Spike: finalize the plan for transition to Make Data Count, how to display the metrics, how to handle legacy counts

pdurbin commented 4 years ago

Established Dataverse installations that have been operating for years might be reluctant to turn on Make Data Count (MDC) because the download counts will be reset to zero unless something is done to somehow copy the "classic" download counts into the new "datasetmetrics" database table that powers MDC download metrics. For example, Harvard Dataverse has over 10 million "classic" downloads:

Screen Shot 2020-02-06 at 11 47 41 AM

Many Dataverse installations probably don't have all the Apache (or Glassfish or whatever) access logs from years ago lying around but the database table filedownload could be used as a source for timestamps of downloads from the "classic" system. After standup on 2020-02-05 @djbrooke @kcondon talked about this and I made the following diagram (best to open it in a new window since the text is so small):

make-data-count

source for the image above: make-data-count.uml.txt

This is what I added to the diagram, which is based on http://guides.dataverse.org/en/4.19/admin/make-data-count.html#architecture

== Historical Logging ==
sysadmin --> exportLogsApi : GET /api/admin/mdc/exportLogs
exportLogsApi --> log : all history from database
main.py --> log : read historical logs
main.py --> datasetMetrics : write metrics to datasetmetrics table (using SUSHI, as below)
main.py --> reports : send metrics to DataCite

This is a bit hand wavy because we'd still use SUSHI as indicated by the Log Processing part of the diagram.

Roughly, the idea is this:

Create a new Dataverse API for sysadmins to use to export from Dataverse a series logs that are compatible with Counter Processor (one per month for 10 years, for example)
Use Counter Processor to populate the new "datasetmetrics" table used by MDC by processing those logs that were exported.
Use Counter Processor to send the historical data to DataCite.

See also pull request IQSS/dataverse#6543

classic table: http://phoenix.dataverse.org/schemaspy/latest/tables/filedownload.html
MDC table http://phoenix.dataverse.org/schemaspy/latest/tables/datasetmetrics.html

landreev commented 1 year ago

Linking the issue IQSS/dataverse#9025 in the main project - a lot of the more recent discussion concerning this issue was happening there. As we revisit it during this spike, let's make sure to take any potentially useful information there into consideration.

jggautier commented 1 year ago

I'm listing some of the collections in the Harvard Dataverse whose admins either rely on download counts now or have told us that they are very interested in being able to rely on them, such as for measuring the impact of their data and data sharing efforts. We/I can talk to the admins of these collections so that how we implement Make Data Count in Harvard Dataverse is informed by a better understanding of needs of users in the Harvard Dataverse:

Admins of the collections in the IFPRI Dataverse
Admins of the MIT Library Dataverse
Admins of the collections in the China Data Lab Dataverse
Admins of the #metoo Digital Media Collection

landreev commented 1 year ago

It does sound to me like it has been established, that there are local users/collections who value their existing download counts. While it may have some value to further investigate their needs, I'm not sure if it's really necessary for the purposes of deciding how to proceed, with the dev. plan. (We already know we can't afford to drop the existing counts). It does sound like we have a degree of consensus that we want to implement what we've been referring to as the "QDR solution" - an option to display both the old-style counts, collected prior to the start of the MDC records, and the MDC metrics. This of course will still be optional; an installation will still have the options to stick with the "classic", non-MDC counts, or the MDC counts exclusively. This is also explained in some detail in the linked issue https://github.com/IQSS/dataverse/issues/9025. So let's "finalize" the plan by prioritizing either merging the existing IQSS/dataverse#6543, or if that one is too old, by pulling in the QDR changes via a new pr.

pdurbin commented 1 year ago

Discussed at standup. No objection to showing both counts.

The next step is probably to see if we can update and merge Jim's pull request:

https://github.com/IQSS/dataverse/pull/6543

pdurbin commented 1 year ago

We gave the PR a 10

This one: https://github.com/IQSS/dataverse/pull/6543

jggautier commented 11 months ago

For the sake of posterity, I should say that although I wrote that I'd like to learn from admins of collections in Harvard Dataverse who generally rely on metrics, I wasn't able to talk with them about this. One of the things I wanted to learn was if it was necessary to show both counts.

Unfortunately @landreev let me know that my comment was taken to mean that these users would like both counts to show, which may or may not be true, and supported the idea of a solution where both counts are shown.

IQSS / dataverse.harvard.edu

Spike: finalize the plan for transition to Make Data Count, how to display the metrics, how to handle legacy counts #75