Closed djbrooke closed 5 years ago
As I mentioned at standup, I'm hacking away at http://metrics.dataverse.org if anyone would like to see the progress. @TaniaSchlatter stopped by (thanks!) and we noticed sorting was different in Firefox vs. Chrome but I believe I've fixed this. She also asked about responsiveness and I added ".resize(true)".
I believe we should fix these two issues as well:
I still have a lot of cleanup to do (and more colors to fix) but overall I've made decent progress. I'd like to shout out to Steve W. from DSS for giving me a d3plus csv example to look at https://dss.iq.harvard.edu/metrics
I just made pull request #5 and deployed the code as of bdc00b6 to http://metrics.dataverse.org
Below I'll paste some screenshots of old/current vs. new. Please note that the hover behavior is different so you should try out the live sites to experience it:
Please note that I included fixes for the following issues:
A few things to refine:
Remove the "not specified" group from the Datasets by subject. This may necessitate changing the header on the chart to "Datasets by subjects specified" for accuracy.
Possibly the same for "Datasets by Category" – we can discuss.
There is a lot more information in the rollovers in the old version. Add the specific type of count that applies to the rollover. It would be very nice to have the quantity for the change (Number of new dataverses)
-Is there a way to scale the charts so that they are more similar to the old charts in terms of length of bars and proportions of bars?
@TaniaSchlatter thanks for the feedback.
One of the big differences between the old/current and the new is that data is coming from 13 installations of Dataverse instead of just Harvard Dataverse. Here are the 13 installations of Dataverse in the screenshots. I've actually been thinking that perhaps we could expose the list of installations (or at least the number of installations) on the page but that would increase the scope.
Harvard Dataverse only uses these subjects:
Agricultural Sciences
Arts and Humanities
Astronomy and Astrophysics
Business and Management
Chemistry
Computer and Information Science
Earth and Environmental Sciences
Engineering
Law
Mathematical Sciences
Medicine, Health and Life Sciences
N/A
Other
Physics
Social Sciences
The other 13 installations of Dataverse have apparently added additional subjects by hacking on their databases. The list is much longer:
77483 Not specified
15133 Social Sciences
2628 Medicine, Health and Life Sciences
1824 Earth and Environmental Sciences
1073 Agricultural Sciences
1005 Physics
977 Arts and Humanities
888 Other
716 Computer and Information Science
406 Astronomy and Astrophysics
367 Engineering
351 Business and Management
217 Law
151 Mathematical Sciences
125 Chemistry
35 Biodiversity and Ecology
27 Soils and soil sciences
27 Omics
25 Microorganisms
25 Arts and Humanities (Ex: English, History, Foreign, Language)
17 Plant Breeding and Plant Products
15 Architecture
13 Farming Systems and Practices
12 Water resources
12 Plant Health and Pathology
10 Animal Breeding and Animal Products
9 Social Sciences (Ex: Education, Politics, Sociology, Economics, Psychology)
9 Forests and Forest Products
9 Food and food processing
9 Climate
9 Animal Health and Pathology
7 Fishes and Aquaculture
7 Computer science
6 Material Science and Engineering
6 Insects and Entomology
6 Human Nutrition and food security
6 Environmental Sciences
5 Fine and Performing Arts
4 Rural and Agricultural Sociology
4 Information management
3 Human Health and Pathology
3 Food Safety and Toxicology
3 Economics
3 Chemistry and chemical engineering
1 Business, Management, Leadership
1 Astronomy
"77483 Not specified" comes entirely from https://data.inra.fr/api/info/metrics/datasets/bySubject
Here's a screenshot:
One solution would be to simply remove https://data.inra.fr from the list of 13 installations.
How do we feel about all the other subjects that were added by installations hacking on their databases?
Let's discuss!
@TaniaSchlatter I hacked together a dynamic list of installations if you'd like to use it as a starting point. I deployed the (uncommited) code to http://metrics.dataverse.org and here's a screenshot:
As I mentioned at standup, I still have some backend work to do to make the blacklist configurable (in 362ddba I started using "mute" but the list is hard coded). Below are screenshots for how it looks now and I'm happy to continue iterating as needed.
In 2e6560e I decided to implement the suggestion by @mheppler to make the list of installations easier on the eyes by replacing the URLs with the names of the installations as they appear on the map at dataverse.org, like this:
(I also noticed that I have unb.ca twice, so we actually are only talking about 12 installations, not 13. Whoops!)
At the design standup I mentioned that we're going to need a valid HTTP cert to use it at https://dataverse.org/metrics which uses an iframe to include https://services.dataverse.harvard.edu/miniverse/metrics/basic-viz/last12-dataverse-org?iframe=true . @djbrooke and I talked this out a bit and decided the code will probably be deployed to that same services.dataverse.harvard.edu server that hosts the map. I had added metrics.dataverse.org to DNS on a whim but nothing says we'll be using it in production and I may well removed it. As such Danny and I decided to rename the repo from metrics.dataverse.org to dataverse-metrics, which I did in 8366d02.
While I was adjusting the README for the rename, I also noted that this repo can be configured for a single installations of Dataverse. Here's how it looks with just https://dataverse.unc.edu for example.
I believe I've done everything on the list so I'm moving this issue to code review. More feedback is welcome, of course! I just deployed the latest to http://metrics.dataverse.org
Today I learned from @TaniaSchlatter that we plan to host two installations of dataverse-metrics:
In https://github.com/IQSS/dataverse/pull/5664 I explain that both of these modes (many or single) are possible. Some installations of Dataverse might also be interested in the "single" mode. I posted a UNC Dataverse example above.
My current to do list:
Ask other developers about the 113K vs 104K discrepancy for datasets (screenshot below).
@TaniaSchlatter I asked about this at standup and @scolapasta explained that the "Total Datasets" number will not necessarily match the sum of the subjects from the "Datasets by Most Common Subject" plot because datasets can have multiple subjects.
As I mentioned to @mheppler and @TaniaSchlatter yesterday I was finally able to figure out how to feed a palatte of colors to a d3plus tree map using "heatmap" yesterday in 2ee4720 and now it should be trivial to factor those colors (and the ones from the bar graphs) out of the Javascript and into the config.json file. @mheppler and I couldn't think of any sane way to put them in CSS.
Yesterday I also added a 1.3 multiplier in 8eb26bb after showing Tania a few variations. The before (no multiplier) is on the left and the after (1.3 multiplier) is on the right:
I mentioned that Steve Worthington from DSS suggested switching from a bars charts to a scatter plots. He wasn't necessarily recommending http://d3plus.org/examples/d3plus-plot/shapeSort/ but that's the one I found quickly that we looked at. The key thing is that with a scatter plot you don't have to start with zero on the Y axis. Here's a screenshot:
He definitely agreed that's it good that we're addressing the Y axis truncation issue reported at https://github.com/IQSS/miniverse/issues/59
Finally, I tried out dataverse-metrics with just Harvard Dataverse as the only installation ("Installation 2" above) and here's how it looks with the latest code:
In 38739d1 I fixed the link colors to match the Style Guide and in 6194224 I made the colors in the plots configurable.
I moved this to code review. I don't write a lot of Javascript so I don't know if I'm doing things wrong. There are some warnings in https://jshint.com
Also, there's still a todo in the README about how there's no license. I don't know if that's important to add now or not. GitHub has a workflow where you can add a license file via a new branch and pull request so maybe we could treat that as a separate small chunk.
I made a few commits to resolve what I perceive to be the more egregious Javascript sins. One can copy and paste "plots.js" into https://jshint.com to see the remaining warnings.
I'm noticing that the treemap colors are not appearing as expected for a heatmap, and that there is a pair of duplicate colors in the code. Please try using the first and last colors in the string to generate the treemap shades (B22200, 282F6B), and see if that provides results that are more in line with what is expected. Thanks!
@TaniaSchlatter thanks for talking this out with me and @djbrooke
Good catch on the duplicate color. I removed that in 2df37b3
Then we decided to use just two colors in the heatmaps, so I updated the sample file in f1486b1
As we discussed, this is just a sample config file so anyone installing dataverse-metrics is welcome to use any color in the rainbow. 😄 🌈
Here's a screenshot of how the "just Harvard Dataverse" installations looks with just two colors in the heatmap. Again, these colors come from the existing colors on https://dataverse.org/metrics
@djbrooke said to go ahead and move this to QA. I would suggest testing both installations types. Again, they are:
Please beware if that you change the config file, the easiest way to make your browser pick it up is to navigate directly to https://example.com/dataverse-metrics/config.json as in the screenshot below (Firefox) and hit refresh:
@pdurbin
I've seen the following when running the aggregator:
[root@dvn-vm4 metrics.dataverse.org]# python metrics.py
Traceback (most recent call last):
File "metrics.py", line 9, in
I am running python 2.6.6 but according to the project, python 2 or 3 is ok I renamed config.json.sample but did not edit the file because it seemed the values were ok it likely is something obvious that I'm missing. I did clone the repo this way: git clone -b 4-d3plus https://github.com/IQSS/metrics.dataverse.org.git
There is data in the cache dir from today.
@kcondon thanks for the heads up about this. I didn't touch any of the Python code in my pull request but I can definitely reproduce the error you're seeing on another one of our test servers. Below I'm including a rough estimate of how long it took to run 5-6 minutes in the output:
[root@dvn-vm2 metrics.dataverse.org]# date
Mon Mar 25 11:59:47 EDT 2019
[root@dvn-vm2 metrics.dataverse.org]# python metrics.py
Traceback (most recent call last):
File "metrics.py", line 9, in <module>
main()
File "metrics.py", line 6, in main
aggregate.main()
File "/usr/local/metrics.dataverse.org/aggregate.py", line 18, in main
process_monthly_itemized_endpoints(monthly_itemized_endpoints, api_response_cache_dir, aggregate_output_dir)
File "/usr/local/metrics.dataverse.org/aggregate.py", line 99, in process_monthly_itemized_endpoints
for name_and_count in json_data['data']:
KeyError: 'data'
[root@dvn-vm2 metrics.dataverse.org]# date
Mon Mar 25 12:05:16 EDT 2019
[root@dvn-vm2 metrics.dataverse.org]#
@kcondon removing the following line from "config.json" that was added in pull request #2 fixes it:
"datasets/bySubject/toMonth"
This new endpoint produces the 7th tsv file that we decided not to try to visualize yet. (There are only 6 plots.) I'll go chat with @djbrooke about scope.
Thanks @pdurbin. If we're not currently visualizing this, I'm OK excluding it right now. We can revisit if we decide to aggregate and visualize this.
@djbrooke great. In 093cf68 I pulled "datasets/bySubject/toMonth" out of the sample config and the README. As we discussed, I reopened #3 to track the fact that we do intended to add this 7th endpoint to the aggregator and create a 7th plot some day. I'll pass this back to QA.
It was decided that some doc showing how to configure this on Apache was needed for testing and deployment.
I added some docs for running dataverse-metrics on Apache in 3887c9e
@pdurbin Ok, thanks for the doc. I followed it again but missed that the clone repo line named the directory dataverse-metrics, that's helpful.
Currently they point to: "api_response_cache_dir": "/var/www/metrics.dataverse.org/cache", "aggregate_output_dir": "/var/www/metrics.dataverse.org",
Maybe they should point to: /var/www/html/dataverse-metrics/cache and /var/www/html/dataverse-metrics.
- Should we change the sample output directories in config.json.sample to reflect the instructions/Apache config?
Great suggestion. Fixed in b2cc387
Thanks
I pinged @donsizemore in IRC yesterday and he was kind enough to try installing dataverse-metrics for UNC. Here's how it looks:
@juancorr tried it out too (thanks!) and here's how his installation looks:
We want to reflect that aggregation can result in some minor discrepancies. At the bottom let's add:
“Metrics are aggregated from multiple Dataverse installations running different versions, with different caching schedules, with some metrics endpoints enabled and others disabled. Minor discrepancies in these metrics can be expected.”
Consider adding render logic that only displays this when there's more than one installation being visualized.
@djbrooke thanks, I added that note in 9069835 and hide it when there is only one installation. Here's how it looks:
@TaniaSchlatter also requested that the blue in the treemaps match the blue from "Total Files" and as of 9069835 they match. Here's a screenshot:
As I mentioned at standup, once pull request #4 has been merged, I'm planning on adding a license (Apache, same as Dataverse) and cut a release. Having new tagged version will make https://github.com/IQSS/dataverse-ansible/issues/51 easier.
I just added a license and cut a release: https://github.com/IQSS/dataverse-metrics/releases/tag/v0.2.0
The metrics on dataverse.org/metrics pull from a Miniverse instance running on AWS. The visualizations themselves are fine, but the Miniverse installation queries the Harvard Dataverse DB directly for the reported numbers. This is not scalable beyond Harvard. We should show the same metrics, but instead pull numbers from the newish Metrics Aggregator (https://github.com/IQSS/metrics.dataverse.org), which can collect metrics from other installations. This reporting will represent the community and not just a single dataverse installation.
So, same metrics, new source.