backdrop / backdrop-issues

Issue tracker for Backdrop core.
144 stars 40 forks source link

Add a Telemetry module to (anonymously) collect useful data #285

Closed klonos closed 3 years ago

klonos commented 10 years ago

See new Issue Summary here: https://github.com/backdrop-ops/backdropcms.org/wiki/Telemetry-Initiative


Telemetry: (anonymously) collect useful data so that we can make better-informed decisions about what should go into (or be removed from) backdrop core.

I remember the endless debates of whether a certain setting/module/feature should be on or off by default leading to 300+ long issues in d.o. Here are some related d.o issues:

Metrics collected in the initial implementation:

Other related d.o issues:

Recent d.org Telemetry initiative: https://www.drupal.org/project/ideas/issues/2940737

Gathering some key data from end-users about how Drupal is used can give the community and the Drupal Association insights that will help us improve the product roadmap, community programs and outreach efforts by the association, and more. Right now, the only data we receive is a very limited amount of data from the sites that call home to Drupal.org for updates information.

Telemetry initiative: Gathering data about Drupal usage

The goal is to gather data about who uses Drupal, what modules they use, what modules they don't use, maybe basic traffic/load information, what php/db versions are in common use, etc... All of this information could be tremendously helpful in setting direction for the project.

We would want to build a modular telemetry system so that we can gather different kinds of data with each major release, if we want to focus on certain areas of the project for improvement.


Use cases we are targeting to improve:

  • Providing aggregate, anonymiZed data to core developers to help them understand real-world usage patterns with Drupal.
  • Providing the Drupal Association with data about the scope of the Drupal community outside of Drupal.org - how many sites are there really? Who are they?
  • Making all of this telemetry sending opt-out.
  • Making sure that headless sites can be identified as Drupal sites underneath by crawlers, and by the d.o metrics system

PR by @docwilmot (based on @quicksketch's work): https://github.com/backdrop/backdrop/pull/3704

klonos commented 10 years ago

This would help us make more educated decisions in issues like #278, #279 etc. instead of having to estimate 80/20 cases (and have people disputing over the percentages).

mikemccaffrey commented 8 years ago

What is the status of this initiative? Has @quicksketch done any work to track configuration statistics for core yet?

I'd like to track the "Users may log in" setting that was introduced in issue #277. It seems like 95%+ of site would be fine with users logging in with either their username and password, and have no need to restrict it to one or another. Since it may create some confusion (see #1994), we may want to remove that setting in a future version if no one is using it.

klonos commented 8 years ago

@mikemccaffrey that use case and the need to help make educated decisions (instead of doing guesswork) is precisely why this issue here was filed for. It was a thing that greatly bothered me in d.org where decisions were made based on what a group of people thought "most people need/use".

mikemccaffrey commented 7 years ago

I think that the first thing that we need to determine is what we are going to call this thing that we are building. It seems like when we are describing this functionality, you could use any combination of "statistics", "feedback", "analytics", "logging", or "reporting".

Maybe it would help if we thought about how we are going to present the feature to the end users. What should we ask next to the checkbox to turn it on and off? "Would you like to send anonymous data to backdropcms.org to help inform future product development?"

What do others think? Is there anything in the project module already that does reporting? Should we look there to see what it is called?

klonos commented 7 years ago

Well, if we get technical about it, then we are not "logging" anything. Not on the actual system where the data gathering is to be performed anyways. The logging part will be made on the b.org side, and even then it's not logging, but rather data storing.

Also, "feedback" to me implies user interaction and not something that is done automatically in the background.

The term "heuristics" was suggested over Gitter. (Ancient Greek: εὑρίσκω, "find" or "discover")

...any approach to problem solving, learning, or discovery that employs a practical method not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. Heuristics can be mental shortcuts that ease the cognitive load of making a decision. Examples of this method include using a rule of thumb, an educated guess, an intuitive judgment, stereotyping, profiling, or common sense.

...although it makes perfect sense etymologically, I'm not sure if most people are familiar with the word or what it means.

"statistics", "analytics" and "reporting" make more sense to me, but these words alone do not provide enough context. Something like "Feature Analytics" perhaps?

klonos commented 7 years ago

"Would you like to send anonymous data to backdropcms.org to help inform future product development?"

This sounds really good 👍. Perhaps lose the word "data" because people will start wondering what sorts of data. Better to say "statistics" instead I think.

"future product development" is very accurate, but people care more about "features" rather than the general product development, so how about adding that word into play in order to make it more "luring" to keep that checkbox ticked.

Also, change the order of the purpose and what we are asking, because when people reach half-way through that sentence and all they have read is "send data", they might skip reading the rest of it.

Something like this perhaps:

"Would you like to help making better-informed decisions when adding new product features to Backdrop by sending anonymous statistics to backdropcms.org?"

Note that were are not telling them that we will also be using that information in order to be removing certain features 😈 [evil laugh]

klonos commented 7 years ago

...would also be a great idea to have a "more about this" link that explains what data is being transmitted, the fact that we do not share this information with 3rd parties and more importantly our privacy policy that ensures that the information collected is anonymous and cannot be traced back to the person/site that provides them.

olafgrabienski commented 7 years ago

not telling them that we will also (...) removing certain features

I guess, that's really a problem if we suggest it's (only) about "adding new product features".

jenlampton commented 7 years ago

"Would you like to send anonymous data to backdropcms.org to help inform future product development?"

I love this language. Product development doesn't limit us to adding features, but could include removing some, too.

Can we add a link from this issue to the one where we itemized the things we want to be tracking? (Maybe that one was in Project module?)

jenlampton commented 7 years ago

We're two weeks away from code freeze for 1.8, and with no code here yet to review or revise it's not likely this feature will get done in time. Bumping to 1.9.

ghost commented 6 years ago

This is something I noticed (in the recent CMS installation comparison video) that Joomla does. Not being at all familiar with Joomla, here's some information I've found that may help in deciding if/how we do this in Backdrop:

My personal opinion is that this would be a good idea, as long as it's done anonymously, and with the users consent (maybe disabled by default?). I also support the idea of linking to a page on BDcms.org specifically discussing this, why we do it, why you can trust us, etc. Maybe even link to the code on Github showing what data we collect?

There's the potential to collect lots of useful information - not just PHP version, Backdrop version, etc., but things like if content revisions are enabled, the site timezone, how often cron runs, etc. (or is that getting too personal?). Also, I like how Joomla provides an API for developers to use that information, giving it back to the community as it were.

klonos commented 6 years ago

Here's what Joomla does:

screen shot 2018-07-02 at 6 38 41 am

Stats Collection in Joomla

Since version 3.5.0

Since Joomla! 3.5 a statistics plugin will submit anonymous data to the Joomla Project. This will only submit the Joomla version, PHP version, database engine and version, and server operating system.

This data is collected to ensure that future versions of Joomla can take advantage of the latest database and PHP features without affecting significant numbers of users. The need for this became clear when a minimum of PHP 5.3.10 was required when Joomla! 3.3 implemented the more secure Bcrypt passwords.

In the interest of full transparency and to help developers this data is publicly available. An API and graphs will show the Joomla version, PHP versions and database engines in use.

If you do not wish to provide the Joomla Project with this information you can disable the plugin called System - Joomla Statistics.

klonos commented 6 years ago

...and here's what their publicly available page with the collected stats looks like:

screen shot 2018-07-02 at 6 42 58 am

docwilmot commented 6 years ago

This sounds like a job for @Gormartsen

dyrer commented 6 years ago

Before collect and send back to Backdrop statistics, admins or site owners must asked if they want to share their data.

klonos commented 6 years ago

@dyrer yep, that is the point of #3168

As it is now, during installation, we ask people if they want to be checking for available updates. If they say yes, we also collect data. We should not be doing that.

The current proposal is to:

dyrer commented 6 years ago

WordPress patches security updated without asking. All versions 4.9.x So I agree always check for updates without collecting data. Administrators should have the option to change their mind after installation. So in my opinion you can have the option during installation but also in options. This options may be located with update options.

klonos commented 6 years ago

I have added a link to the d.org Telemetry initiative in the issues summary: https://www.drupal.org/project/ideas/issues/2940737

klonos commented 5 years ago

...from https://forums.classicpress.net/t/classicpress-1-0-0-aurora-release-notes/910

Admin dashboard. WordPress-specific features like community events and featured plugins have been removed and/or replaced with ClassicPress equivalents. For example, we’ve added a “Featured Petitions” widget to encourage community participation in our development process.

klonos commented 5 years ago

image

klonos commented 5 years ago

Include an anonymous site identifier when communicating with the ClassicPress updates API (details). ClassicPress can use this to count active sites, but not to identify them individually.

jenlampton commented 5 years ago

There's no PR here yet, should we bump this issue to the next milestone, or is this something we can get done in less than one month?

I think we may be able to move the current info into a separate module (without changing any functionality) in the next month. But we may need to set aside the addition of any significant new features for 1.14.

docwilmot commented 5 years ago

The problem here is that we havent decided how this data will be fetched, and I suspect we'll need the PMC to chime in here. This will need to be a specific service on B.org that either fetches this info or is sent this info from sites. Currently Update module just fetches a known feed; that wont work for this noble proposal here.

Once this has been decided then someone can start building the core mechanism to collect and package and send this data to the mothership.

jenlampton commented 5 years ago

we havent decided how this data will be fetched

I don't think there's any decision to be made here. The data will be collected in the same way project module collects usage statistics now: each site will send information to a service at backdropcms.org.

Today in the meeting we discussed keeping the telemetry data separate from the project data on b.org, because project module is already quite complex, and also because it's unlikely that the telemetry data would be useful as a contrib project.

docwilmot commented 5 years ago

IIRC Update module doesn't send any data anywhere; it fetches an XML form B.org for each project it wants to check for updates.

Project module on B.org simply counts how many times sites are fetching.

This project may end up using Project/Update to do this work, but the point of my last post is that we will need new (complex) code in multiple locations, contrib and custom on B.org (Project and Borg), along with core code changes, and we'll also need to decide and design how we do this, efficiently. I suppose a single dev could build the Project code and the core code and test them talking to each other, but I think its more reasonable for this to be a joint discussion led by the senior programming leads.

jenlampton commented 5 years ago

If this is the case then there's definitely too much work to do here for 1.13. Bumping milestone. Also related: https://github.com/backdrop/backdrop-issues/issues/3168

Oh, and I checked, we do have an open PMC issue about https://github.com/backdrop/backdrop-issues/issues/3168, I will add to that a discussion of this issue as well.

jenlampton commented 5 years ago

Removing the 1.14 milestone, and adding the milestone candidate label. If this issue gets an advocate who wants to push it through the 1.14 release, it can get the milestone back :)

herbdool commented 5 years ago

@docwilmot according to https://www.drupal.org/project/drupal/issues/1036780#comment-4970352 it also parses the URL for projects and more recently sub-modules, themes. So if we added things to that URL they would at least be in the logs. Then need to add more parsing at b.org end.

I'm interested in a MVP that adds a couple key items to the URL, such as PHP version and web server. But I do agree that longer term it should be a separate module and not stuff a URL with all the data.

docwilmot commented 5 years ago

@herbdool you're right, I didn't notice that, its been in core for years it seems. But the code to actually parse the URL isn't in Project.module though?

herbdool commented 5 years ago

@docwilmot not clear if d.org is using this patch https://www.drupal.org/project/project/issues/1274766 or if they haven't published it. Perhaps still useful as starting point.

docwilmot commented 5 years ago

If they are its a private patch, or maybe a custom module. That code isnt in Project.

serundeputy commented 5 years ago

I agree w/ @docwilmot the main thing needed to move this issue forward is a detailed architecture and a plan.

I'll put one possible architectrue out there to get the ball rolling.

Elasticsearch

To me this whole thing screams elasticsearch; https://www.elastic.co/products/elasticsearch?ultron=[B]-Elastic-US+CA-Exact&blade=bing-s&Device=c&thor=elasticsearch&msclkid=5325aa35318615d0f4aef72de0066aba

so one possible implementation could be;

This approach would also take the datastore and data analysis off the plate of b.org and allow b.org to keep being a good Backdrop site w/out overburdening it.

Other Options

I'm/we're interested in other architectures, but until an architecture is decided on planning is almost futile. Once we have the tech stack we can create tasks and assign them to interested developers (I count myself amongst that group)

stpaultim commented 5 years ago

@klonos just shared this interesting link/chart in gitter. https://wordpress.org/about/stats/

klonos commented 5 years ago

...and because the internet never stays the same, here's what that looks like at the moment:

screencapture-wordpress-org-about-stats-2019-09-18-07_25_37

ghost commented 5 years ago

In my non-Backdrop work, I've spent the last month setting up a system whereby Behat tests can be run automatically when code is pushed to a repo (similar to how our PRs are tested automatically here on GitHub).

To do this, I essentially:

I'm no expert, but wouldn't this kind of thing work here too?

Or is this just a simplified version of what @serundeputy already suggested with ElasticSearch? (I'm not familiar with ES, but the name makes me think of Apache Solr, which makes me wonder how its related to collecting telemetry data...)

klonos commented 5 years ago

which makes me wonder how its related to collecting telemetry data

I believe it has to do more with displaying the data and allowing people to search it.

klonos commented 5 years ago

Just received this via email:

Dear GitLab users and customers,

On October 23, we sent an email entitled “Important Updates to our Terms of Service and Telemetry Services” announcing upcoming changes. Based on considerable feedback from our customers, users, and the broader community, we reversed course the next day and removed those changes before they went into effect. Further, GitLab will commit to not implementing telemetry in our products that sends usage data to a third-party product analytics service. This clearly struck a nerve with our community and I apologize for this mistake.

So, what happened? In an effort to improve our user experience, we decided to implement user behavior tracking with both first and third-party technology. Clearly, our evaluation and communication processes for rolling out a change like this were lacking and we need to improve those processes. But that’s not the main thing we did wrong.

Our main mistake was that we did not live up to our own core value of collaboration by including our users, contributors, and customers in the strategy discussion and, for that, I am truly sorry. It shouldn’t have surprised us that you have strong feelings about opt-in/opt-out decisions, first versus third-party tracking, data protection, security, deployment flexibility and many other topics, and we should have listened first.

So, where do we go from here? The first step is a retrospective that is happening on October 29 to document what went wrong. We are reaching out to customers who expressed concerns and collecting feedback from users and the wider community. We will put together a new proposal for improving the user experience and share it for feedback. We made a mistake by not collaborating, so now we will take as much time as needed to make sure we get this right. You can be part of the collaboration by posting comments in this issue: https://gitlab.com/gitlab-com/www-gitlab-com/issues/5672. If you are a customer, you may also reach out to your GitLab representative if you have additional feedback.

I am glad you hold GitLab to a higher standard. If we are going to be transparent and collaborative, we need to do it consistently and learn from our mistakes.

I would like us to have this in mind and not repeat any such mistakes with our implementation.

stpaultim commented 4 years ago

@serundeputy and I recently talked about putting some focus on this issue. This could be an initiative or I might advocate for this issue, if that would be helpful. But, I need help in figuring out what this means and how to approach it. I am thinking about setting up a special zoom meeting with anyone that wants to talk about this and how to move it forward.

docwilmot commented 4 years ago

I propose we break this down into the required parts to start, and make some sub-issues:

On backdropCMS:

(would Elastic Search be an option for 3 and 4 above?)

In Backdrop core:

Lots of decisions so far. The code would be secondary I suspect.

klonos commented 4 years ago

Adding this here as an example/idea of how things could look/work in the UI:

Screen Shot 2020-04-10 at 4 52 09 am

The "Read more" link goes to https://code.visualstudio.com/docs/supporting/faq#_how-to-disable-telemetry-reporting

PS: I like how they have a separate "Crash reporting": https://code.visualstudio.com/docs/supporting/faq#_how-to-disable-crash-reporting ...which in our case could be "PHP error and WSOD reporting" or something like that.

serundeputy commented 4 years ago

I'v got a start on a new telemetry module for backdrop core collecting:

$data = [
  'site_key' => backdrop_hmac_base64($base_url, backdrop_get_private_key()),
  'php' => VERSION,
  'mysql_type' => 'MariaDB|MySQL',
  'mysql_version' => VERSION,
];
docwilmot commented 4 years ago

@serundeputy please see my comments https://github.com/backdrop/backdrop-issues/issues/285#issuecomment-591096635 and advise if this is realistic or necessary and how the rest of us can participate. I imagine an initiative like this would need multiple moving parts to work together, so we'd need a plan for the rest of the stuff that you're not personally working on. How are we approaching getting all the parts working here?

docwilmot commented 4 years ago

P.s I assumed you were the initiative lead for this. Dont recall who is.

jenlampton commented 4 years ago

@serundeputy can we get an update on Telemetry for the weekly meetings?

klonos commented 4 years ago

I'm really interested in helping with this. Do let us know what you need help with @serundeputy 🙂

serundeputy commented 4 years ago

Thanks everyone!

I've not had any tangible progress on this since the initial PoC module. We need an:

jenlampton commented 4 years ago

@serundeputy cam you explain what an ES server is, and why we need one? (not everyone reading / contributing to these issues understands the acronyms).

We also need some direction as to how the rest of us can contribute. Would it be helpful to have people start writing gathering code for all the issues linked in the top post, for example?

stpaultim commented 4 years ago

It's my understanding from emails with @serundeputy that @serundeputy may not have time to work on this in the near future and I have volunteered to assume some responsibility for moving this initiative forward. Still waiting to confirm this with @serundeputy.

We had a discussion about this with @quicksketch, @klonos, and myself at Backdrop LIVE. I have a Google Doc with a bunch of thoughts in in at my goal is to add this to the issue queue quickly. I am thinking about starting a new meta issue with a clean history and very complete summary of what has been discussed so far.

Here is a link to Google Doc.

stpaultim commented 4 years ago

I'm thinking about creating a new META issue in the BackdropCMS.org issue queue, since this initiative is broader than just changing core code. It also involved policy decisions and and code on BackdropCMS.org or other locations. But for today, we'll stick with this issue.

But (for now) I started by creating a summary of this issue in the BackdropCMS.org repo WIKI. This is a DRAFT summary of this issue based upon my understanding of where things are at based upon the meeting during Backdrop LIVE on Sept 17 and after reviewing this issue in detail. https://github.com/backdrop-ops/backdropcms.org/wiki/Telemetry-Initiative

My summary assumes that we are NOT using anything like ElasticSearch at this time. We can move in that direction in the future, but during the Backdrop LIVE discussion we decided to start simple and keep the data in BackdropCMS.org database for now.

This is something that may need additional discussion, but I'm working with that assumption for now.

Please, review my summary and ask questions, provide clarifications, and anything else.

Hopefully, we can talk about this at one of the next two dev meetings.

After bringing this up at a DEV meeting, I would like to plan another meeting to just work on this initiative. Please, let me know if you have time and interest to participate in this initiative.

ghost commented 4 years ago

I like this! I read through the summary, and one thing stood out to me as a great idea: hook_telemetry_data() This was mentioned in reference to contrib integration, but I see this as being the way to implement this feature in core and contrib. Here're some thoughts:

Create a hook_telemetry_data_types() hook. This will allow a module (core or contrib) to define the types of data it's going to collect. For example:

function system_telemetry_data_types() {
  return array(
    'php_version' = array(
      'title' => t('PHP version'),
      'description' => t("The version of PHP used on this site. E.g. '7.2'."),
    ),
    'mysql_version' = array(
      'title' => t('MySQL version'),
      'description' => t("The version of MySQL/MariaDB used on this site. E.g. '5.7'."),
    ),
  );
}

Create a hook_telemetry_data() hook. This is what will be called on cron/update (or whenever data is collected) and will return the data apropriately. For example:

function system_telemetry_data($data_type) {
  $data = NULL;

  switch ($data_type) {
    case 'php_version':
      $data = phpversion();
      break;
    case 'mysql_version':
      $data = Database::getConnection()->version();
      break;
  }

  return $data;
}
Create a report page that lists all the data that's collected, and it's current value. This'll give people an idea of exactly what's being shared, and helps with transparency. For example: Data Description Value
PHP version The version of PHP used on this site. E.g. '7.2'. 7.2.34
MySQL version The version of MySQL/MariaDB used on this site. E.g. '5.7'. 5.5.5-10.3.22-MariaDB

Other ideas: