IQSS / dataverse-metrics

Aggregate and visualize metrics for installations of Dataverse around the world
https://dataverse.org/metrics
Apache License 2.0
8 stars 9 forks source link

Scalable Metrics Visualizations using the Metrics Aggregator #4

Closed djbrooke closed 5 years ago

djbrooke commented 5 years ago

The metrics on dataverse.org/metrics pull from a Miniverse instance running on AWS. The visualizations themselves are fine, but the Miniverse installation queries the Harvard Dataverse DB directly for the reported numbers. This is not scalable beyond Harvard. We should show the same metrics, but instead pull numbers from the newish Metrics Aggregator (https://github.com/IQSS/metrics.dataverse.org), which can collect metrics from other installations. This reporting will represent the community and not just a single dataverse installation.

So, same metrics, new source.

djbrooke commented 5 years ago
pdurbin commented 5 years ago

As I mentioned at standup, I'm hacking away at http://metrics.dataverse.org if anyone would like to see the progress. @TaniaSchlatter stopped by (thanks!) and we noticed sorting was different in Firefox vs. Chrome but I believe I've fixed this. She also asked about responsiveness and I added ".resize(true)".

I believe we should fix these two issues as well:

I still have a lot of cleanup to do (and more colors to fix) but overall I've made decent progress. I'd like to shout out to Steve W. from DSS for giving me a d3plus csv example to look at https://dss.iq.harvard.edu/metrics

pdurbin commented 5 years ago

I just made pull request #5 and deployed the code as of bdc00b6 to http://metrics.dataverse.org

Below I'll paste some screenshots of old/current vs. new. Please note that the hover behavior is different so you should try out the live sites to experience it:

Please note that I included fixes for the following issues:

old/current whole page

old2019-03-12_14 46 38

new whole page

new2019-03-12_14 46 49

old/current hover example

Screen Shot 2019-03-12 at 2 48 28 PM

new hover example

Screen Shot 2019-03-12 at 2 48 12 PM

TaniaSchlatter commented 5 years ago

A few things to refine:

-Is there a way to scale the charts so that they are more similar to the old charts in terms of length of bars and proportions of bars?

pdurbin commented 5 years ago

@TaniaSchlatter thanks for the feedback.

One of the big differences between the old/current and the new is that data is coming from 13 installations of Dataverse instead of just Harvard Dataverse. Here are the 13 installations of Dataverse in the screenshots. I've actually been thinking that perhaps we could expose the list of installations (or at least the number of installations) on the page but that would increase the scope.

13 Dataverse installations polled

Datasets by Subject

Harvard Dataverse only uses these subjects:

Agricultural Sciences
Arts and Humanities
Astronomy and Astrophysics
Business and Management
Chemistry
Computer and Information Science
Earth and Environmental Sciences
Engineering
Law
Mathematical Sciences
Medicine, Health and Life Sciences
N/A
Other
Physics
Social Sciences

The other 13 installations of Dataverse have apparently added additional subjects by hacking on their databases. The list is much longer:

77483   Not specified
15133   Social Sciences
2628    Medicine, Health and Life Sciences
1824    Earth and Environmental Sciences
1073    Agricultural Sciences
1005    Physics
977     Arts and Humanities
888     Other
716     Computer and Information Science
406     Astronomy and Astrophysics
367     Engineering
351     Business and Management
217     Law
151     Mathematical Sciences
125     Chemistry
35      Biodiversity and Ecology
27      Soils and soil sciences
27      Omics
25      Microorganisms
25      Arts and Humanities (Ex: English, History, Foreign, Language)
17      Plant Breeding and Plant Products
15      Architecture
13      Farming Systems and Practices
12      Water resources
12      Plant Health and Pathology
10      Animal Breeding and Animal Products
9       Social Sciences (Ex: Education, Politics, Sociology, Economics, Psychology)
9       Forests and Forest Products
9       Food and food processing
9       Climate
9       Animal Health and Pathology
7       Fishes and Aquaculture
7       Computer science
6       Material Science and Engineering
6       Insects and Entomology
6       Human Nutrition and food security
6       Environmental Sciences
5       Fine and Performing Arts
4       Rural and Agricultural Sociology
4       Information management
3       Human Health and Pathology
3       Food Safety and Toxicology
3       Economics
3       Chemistry and chemical engineering
1       Business, Management, Leadership
1       Astronomy

"77483 Not specified" comes entirely from https://data.inra.fr/api/info/metrics/datasets/bySubject

Here's a screenshot:

Screen Shot 2019-03-13 at 11 07 46 AM

One solution would be to simply remove https://data.inra.fr from the list of 13 installations.

How do we feel about all the other subjects that were added by installations hacking on their databases?

Other feedback

Let's discuss!

djbrooke commented 5 years ago
pdurbin commented 5 years ago

@TaniaSchlatter I hacked together a dynamic list of installations if you'd like to use it as a starting point. I deployed the (uncommited) code to http://metrics.dataverse.org and here's a screenshot:

Screen Shot 2019-03-13 at 12 53 37 PM

pdurbin commented 5 years ago

As I mentioned at standup, I still have some backend work to do to make the blacklist configurable (in 362ddba I started using "mute" but the list is hard coded). Below are screenshots for how it looks now and I'm happy to continue iterating as needed.

new whole page

Dataverse_Metrics_-_2019-03-18_12 39 47

new hover example

Screen Shot 2019-03-18 at 12 36 50 PM

pdurbin commented 5 years ago

In 2e6560e I decided to implement the suggestion by @mheppler to make the list of installations easier on the eyes by replacing the URLs with the names of the installations as they appear on the map at dataverse.org, like this:

Screen Shot 2019-03-18 at 4 32 54 PM

(I also noticed that I have unb.ca twice, so we actually are only talking about 12 installations, not 13. Whoops!)

At the design standup I mentioned that we're going to need a valid HTTP cert to use it at https://dataverse.org/metrics which uses an iframe to include https://services.dataverse.harvard.edu/miniverse/metrics/basic-viz/last12-dataverse-org?iframe=true . @djbrooke and I talked this out a bit and decided the code will probably be deployed to that same services.dataverse.harvard.edu server that hosts the map. I had added metrics.dataverse.org to DNS on a whim but nothing says we'll be using it in production and I may well removed it. As such Danny and I decided to rename the repo from metrics.dataverse.org to dataverse-metrics, which I did in 8366d02.

While I was adjusting the README for the rename, I also noted that this repo can be configured for a single installations of Dataverse. Here's how it looks with just https://dataverse.unc.edu for example.

uncOnlyDataverse_Metrics_-_2019-03-18_16 08 25

I believe I've done everything on the list so I'm moving this issue to code review. More feedback is welcome, of course! I just deployed the latest to http://metrics.dataverse.org

pdurbin commented 5 years ago

Today I learned from @TaniaSchlatter that we plan to host two installations of dataverse-metrics:

In https://github.com/IQSS/dataverse/pull/5664 I explain that both of these modes (many or single) are possible. Some installations of Dataverse might also be interested in the "single" mode. I posted a UNC Dataverse example above.

My current to do list:

Screen Shot 2019-03-19 at 2 32 03 PM

pdurbin commented 5 years ago

Ask other developers about the 113K vs 104K discrepancy for datasets (screenshot below).

@TaniaSchlatter I asked about this at standup and @scolapasta explained that the "Total Datasets" number will not necessarily match the sum of the subjects from the "Datasets by Most Common Subject" plot because datasets can have multiple subjects.

As I mentioned to @mheppler and @TaniaSchlatter yesterday I was finally able to figure out how to feed a palatte of colors to a d3plus tree map using "heatmap" yesterday in 2ee4720 and now it should be trivial to factor those colors (and the ones from the bar graphs) out of the Javascript and into the config.json file. @mheppler and I couldn't think of any sane way to put them in CSS.

Yesterday I also added a 1.3 multiplier in 8eb26bb after showing Tania a few variations. The before (no multiplier) is on the left and the after (1.3 multiplier) is on the right:

Screen Shot 2019-03-21 at 12 23 14 PM

I mentioned that Steve Worthington from DSS suggested switching from a bars charts to a scatter plots. He wasn't necessarily recommending http://d3plus.org/examples/d3plus-plot/shapeSort/ but that's the one I found quickly that we looked at. The key thing is that with a scatter plot you don't have to start with zero on the Y axis. Here's a screenshot:

Screen Shot 2019-03-21 at 12 26 07 PM

He definitely agreed that's it good that we're addressing the Y axis truncation issue reported at https://github.com/IQSS/miniverse/issues/59

Finally, I tried out dataverse-metrics with just Harvard Dataverse as the only installation ("Installation 2" above) and here's how it looks with the latest code:

Dataverse_Metrics_-_2019-03-21_12 28 40

pdurbin commented 5 years ago

In 38739d1 I fixed the link colors to match the Style Guide and in 6194224 I made the colors in the plots configurable.

I moved this to code review. I don't write a lot of Javascript so I don't know if I'm doing things wrong. There are some warnings in https://jshint.com

Also, there's still a todo in the README about how there's no license. I don't know if that's important to add now or not. GitHub has a workflow where you can add a license file via a new branch and pull request so maybe we could treat that as a separate small chunk.

pdurbin commented 5 years ago

I made a few commits to resolve what I perceive to be the more egregious Javascript sins. One can copy and paste "plots.js" into https://jshint.com to see the remaining warnings.

TaniaSchlatter commented 5 years ago

I'm noticing that the treemap colors are not appearing as expected for a heatmap, and that there is a pair of duplicate colors in the code. Please try using the first and last colors in the string to generate the treemap shades (B22200, 282F6B), and see if that provides results that are more in line with what is expected. Thanks!

pdurbin commented 5 years ago

@TaniaSchlatter thanks for talking this out with me and @djbrooke

Good catch on the duplicate color. I removed that in 2df37b3

Then we decided to use just two colors in the heatmaps, so I updated the sample file in f1486b1

As we discussed, this is just a sample config file so anyone installing dataverse-metrics is welcome to use any color in the rainbow. 😄 🌈

Here's a screenshot of how the "just Harvard Dataverse" installations looks with just two colors in the heatmap. Again, these colors come from the existing colors on https://dataverse.org/metrics

Dataverse_Metrics_-_2019-03-22_14 33 03

@djbrooke said to go ahead and move this to QA. I would suggest testing both installations types. Again, they are:

Please beware if that you change the config file, the easiest way to make your browser pick it up is to navigate directly to https://example.com/dataverse-metrics/config.json as in the screenshot below (Firefox) and hit refresh:

Screen Shot 2019-03-22 at 2 39 04 PM

kcondon commented 5 years ago

@pdurbin

I've seen the following when running the aggregator: [root@dvn-vm4 metrics.dataverse.org]# python metrics.py Traceback (most recent call last): File "metrics.py", line 9, in main() File "metrics.py", line 6, in main aggregate.main() File "/usr/local/metrics.dataverse.org/aggregate.py", line 18, in main process_monthly_itemized_endpoints(monthly_itemized_endpoints, api_response_cache_dir, aggregate_output_dir) File "/usr/local/metrics.dataverse.org/aggregate.py", line 99, in process_monthly_itemized_endpoints for name_and_count in json_data['data']: KeyError: 'data'

I am running python 2.6.6 but according to the project, python 2 or 3 is ok I renamed config.json.sample but did not edit the file because it seemed the values were ok it likely is something obvious that I'm missing. I did clone the repo this way: git clone -b 4-d3plus https://github.com/IQSS/metrics.dataverse.org.git

There is data in the cache dir from today.

pdurbin commented 5 years ago

@kcondon thanks for the heads up about this. I didn't touch any of the Python code in my pull request but I can definitely reproduce the error you're seeing on another one of our test servers. Below I'm including a rough estimate of how long it took to run 5-6 minutes in the output:

[root@dvn-vm2 metrics.dataverse.org]# date
Mon Mar 25 11:59:47 EDT 2019
[root@dvn-vm2 metrics.dataverse.org]# python metrics.py 
Traceback (most recent call last):
  File "metrics.py", line 9, in <module>
    main()
  File "metrics.py", line 6, in main
    aggregate.main()
  File "/usr/local/metrics.dataverse.org/aggregate.py", line 18, in main
    process_monthly_itemized_endpoints(monthly_itemized_endpoints, api_response_cache_dir, aggregate_output_dir)
  File "/usr/local/metrics.dataverse.org/aggregate.py", line 99, in process_monthly_itemized_endpoints
    for name_and_count in json_data['data']:
KeyError: 'data'
[root@dvn-vm2 metrics.dataverse.org]# date
Mon Mar 25 12:05:16 EDT 2019
[root@dvn-vm2 metrics.dataverse.org]# 
pdurbin commented 5 years ago

@kcondon removing the following line from "config.json" that was added in pull request #2 fixes it:

"datasets/bySubject/toMonth"

This new endpoint produces the 7th tsv file that we decided not to try to visualize yet. (There are only 6 plots.) I'll go chat with @djbrooke about scope.

djbrooke commented 5 years ago

Thanks @pdurbin. If we're not currently visualizing this, I'm OK excluding it right now. We can revisit if we decide to aggregate and visualize this.

pdurbin commented 5 years ago

@djbrooke great. In 093cf68 I pulled "datasets/bySubject/toMonth" out of the sample config and the README. As we discussed, I reopened #3 to track the fact that we do intended to add this 7th endpoint to the aggregator and create a 7th plot some day. I'll pass this back to QA.

kcondon commented 5 years ago

It was decided that some doc showing how to configure this on Apache was needed for testing and deployment.

pdurbin commented 5 years ago

I added some docs for running dataverse-metrics on Apache in 3887c9e

kcondon commented 5 years ago

@pdurbin Ok, thanks for the doc. I followed it again but missed that the clone repo line named the directory dataverse-metrics, that's helpful.

  1. Should we change the sample output directories in config.json.sample to reflect the instructions/Apache config?

Currently they point to: "api_response_cache_dir": "/var/www/metrics.dataverse.org/cache", "aggregate_output_dir": "/var/www/metrics.dataverse.org",

Maybe they should point to: /var/www/html/dataverse-metrics/cache and /var/www/html/dataverse-metrics.

  1. So far aggregated metrics look good with the exception of file downloads. They seem to be off by a bit but still in the same ballpark (6.69M agg, 7.24M browse). Finishing categories and subjects now. I will post a spreadsheet with aggregate versus browsed stats
pdurbin commented 5 years ago
  1. Should we change the sample output directories in config.json.sample to reflect the instructions/Apache config?

Great suggestion. Fixed in b2cc387

kcondon commented 5 years ago

Thanks

pdurbin commented 5 years ago

I pinged @donsizemore in IRC yesterday and he was kind enough to try installing dataverse-metrics for UNC. Here's how it looks:

Dataverse_Metrics_-_2019-03-27_07 49 19

pdurbin commented 5 years ago

@juancorr tried it out too (thanks!) and here's how his installation looks:

Dataverse_Metrics_-_2019-03-27_09 56 39

djbrooke commented 5 years ago

We want to reflect that aggregation can result in some minor discrepancies. At the bottom let's add:

“Metrics are aggregated from multiple Dataverse installations running different versions, with different caching schedules, with some metrics endpoints enabled and others disabled. Minor discrepancies in these metrics can be expected.”

Consider adding render logic that only displays this when there's more than one installation being visualized.

pdurbin commented 5 years ago

@djbrooke thanks, I added that note in 9069835 and hide it when there is only one installation. Here's how it looks:

Screen Shot 2019-03-27 at 12 36 05 PM

@TaniaSchlatter also requested that the blue in the treemaps match the blue from "Total Files" and as of 9069835 they match. Here's a screenshot:

Screen Shot 2019-03-27 at 12 31 46 PM

As I mentioned at standup, once pull request #4 has been merged, I'm planning on adding a license (Apache, same as Dataverse) and cut a release. Having new tagged version will make https://github.com/IQSS/dataverse-ansible/issues/51 easier.

pdurbin commented 5 years ago

I just added a license and cut a release: https://github.com/IQSS/dataverse-metrics/releases/tag/v0.2.0