Spike: finalize the plan for transition to Make Data Count, how to display the metrics, how to handle legacy counts

pdurbin commented 4 years ago

Established Dataverse installations that have been operating for years might be reluctant to turn on Make Data Count (MDC) because the download counts will be reset to zero unless something is done to somehow copy the "classic" download counts into the new "datasetmetrics" database table that powers MDC download metrics. For example, Harvard Dataverse has over 10 million "classic" downloads:

Screen Shot 2020-02-06 at 11 47 41 AM

Many Dataverse installations probably don't have all the Apache (or Glassfish or whatever) access logs from years ago lying around but the database table filedownload could be used as a source for timestamps of downloads from the "classic" system. After standup on 2020-02-05 @djbrooke @kcondon talked about this and I made the following diagram (best to open it in a new window since the text is so small):

make-data-count

source for the image above: make-data-count.uml.txt

This is what I added to the diagram, which is based on http://guides.dataverse.org/en/4.19/admin/make-data-count.html#architecture

== Historical Logging ==
sysadmin --> exportLogsApi : GET /api/admin/mdc/exportLogs
exportLogsApi --> log : all history from database
main.py --> log : read historical logs
main.py --> datasetMetrics : write metrics to datasetmetrics table (using SUSHI, as below)
main.py --> reports : send metrics to DataCite

This is a bit hand wavy because we'd still use SUSHI as indicated by the Log Processing part of the diagram.

Roughly, the idea is this:

Create a new Dataverse API for sysadmins to use to export from Dataverse a series logs that are compatible with Counter Processor (one per month for 10 years, for example)
Use Counter Processor to populate the new "datasetmetrics" table used by MDC by processing those logs that were exported.
Use Counter Processor to send the historical data to DataCite.

See also pull request IQSS/dataverse#6543

classic table: http://phoenix.dataverse.org/schemaspy/latest/tables/filedownload.html
MDC table http://phoenix.dataverse.org/schemaspy/latest/tables/datasetmetrics.html

djbrooke commented 4 years ago

Does COUNTER care if there are missing fields from these counts we move over?
What we currently missing? What would we miss out on if we don't comply with COUNTER?
We don't expect to do any log scraping as part of this

qqmyers commented 4 years ago

FWIW: I think the primary thing Counter is doing that affects the aggregate counts being shown today is to remove robot hits. If robots have been being excluded by a robots.txt file, then this may not be too important. However, the way it works is to review ip addresses versus a blacklist, so not having ip addresses would stop that. I haven't found any ip address info in the database - not sure if there's some other field that would at least indicate a human/non-robot access. I'm not sure if having a null ip will cause it to break or not - I suspect breaking as one of the errors I previously found (logging un-published downloads) was something I caught because counter threw an error with a null field.

In terms of finer breakdowns, counter also uses ip address to find countries, and I think it looks for sessioncookie info to assess unique counts. The sessioninfo in the filedownload table might be of use there. The human/machine separation is from the user-agent - not sure there's anything in the tables to guess at that (or at least break browser from curl api calls).

pdurbin commented 4 years ago

@qqmyers thanks for your comment. Yes, there are a lot of unknowns here, which is why this issue got estimated as a large yesterday in sprint planning.

Heads up that I just updated the description to have more than just a diagram and a link to your pull request. I hope it helps clarify what we talked about.

Further questions and comments are welcome from all, of course!

jggautier commented 4 years ago

1. What are we currently missing [regarding downloads]?

It's not clear to me, and perhaps this investigation can include figuring out if and how the following information about pre-MDC counts can be determined:

Legitimate machine access (as opposed to internet robots and crawlers)
Regular versus machine downloads/explores. Regular means "typical user behavior" and machine means "typical [non-robot and crawler] machine behavior"
Downloads/explores that exclude double clicks. If a file is downloaded/explored in the same session within at most 30 seconds, that must count as only one download/explore.
Unique downloads/explores. How many times did a download/explore happen in a session?
Countrycodes. In what country was the user/machine when she/it downloaded the file?
Dataset versions. What dataset version was the user/machine on when she/it downloaded the file?
Dates of downloads. For many file downloads in the Harvard Dataverse database, there is no timestamp, so we don't know when (what year, month, day, time of day) the file was downloaded. Is this the case for any other Dataverse installation?

2. Does COUNTER care if there are missing fields from these counts we move over?

"Section 3.3.9: Missing and Unknown Field Values" of the COUNTER guide says that COUNTER will accept missing/unknown values for fields, and indicates how to express missing values for optional and required fields.

Which fields are optional and required? The COUNTER guide says to refer to "Section 8" of the "Research Data SUSHI API Specification". This article cites this API documentation as the Research Data SUSHI API Specification. I can't tell what section 8 is, but there's a page called "dataset usage" model that has red asterisks next to a few fields, like dataset title, type, and platform. Are these the required fields, and is everything else optional?

3. What would we miss out on if we don't comply with COUNTER?

Are the auditors checking on a pass/fail basis? Or are they grading implementations on a scale? (The bottom of the COUNTER guide lists companies doing audits.)

From the previous comments, it sounds like we're limiting ourselves to info in the database because it's the source that Dataverse-based repositories are most likely to have access to. Not every Dataverse-based repository will have access to access logs, which is the only other option.

@djbrooke, for questions 2 and 3 should we schedule a meeting with the COUNTER team? If we can confirm that many of the types of metrics in the first question can be missing from monthly reports, e.g. machine counts, unique counts, dataset versions, it might make the conversion easier (or possible).

@qqmyers has already done some digging in the database for parts of the first question. It seems like the big question is if the database can help us determine human (regular) downloads/explores versus machine downloads/explores that exclude bots and crawlers. I also don't remember a way to do this using info in the database, but I can look. I also plan to explore the database to see how we might determine when a download/explore happened within a session and on what dataset version a file download happened.

If anyone else with time or familiar with the database can help answer these questions, please feel free :)

qqmyers commented 4 years ago

FWIW: If the access logs are available for some instances and they do a better job than the db, it might be worth considering a split - use access logs or do the dual counts as I proposed earlier. They should co-exist well - the cut-over date for dual counts could be the start of time (1970!) by default so it wouldn't trigger if the access logs had been mined.

jggautier commented 4 years ago

I agree @qqmyers.

Just ran into a Google doc that lists methods for getting info for MDC counts. It's targeted to Harvard Dataverse, but I think it supports the idea that the access logs would be the only way for Dataverse-based repositories to distinguish between "regular" and "machine" downloads (and exclude bots by filtering by IP address), which is required for converting "classic" downloads. We can't send request counts that combine "regular" and "machine" downloads.

We also can't send counts unless we can exclude double clicks, which Harvard Dataverse can't do because ~630k older download counts don't have timestamps.

So even if we can send reports that are missing most info about requests (e.g. unique counts, dataset versions, dates of requests), it seems that repositories that want to provide MDC counts but (1) cannot use access logs or (2) have download counts without timestamps cannot convert their "classic" download counts. For those repositories, they can either:

not provide pre-MDC counts or
offer those counts somehow in addition to MDC counts (e.g. the dual count proposal)

That Google doc also brings up another question about sessions: When determining unique and total counts, can we use the session cookie ID (instead of the user cookie ID) to define a session?

That Google doc defines session cookie IDs and user cookie IDs. The article (page 27) says that session cookie IDs (which Dataverse's database does store) can be used to filter double clicks, although it's not the most reliable way. This filtering also must be done for MDC compliant metrics.

But when determining unique versus total counts, can Dataverse also define a session by the session cookie ID instead of a user cookie ID? That Google doc says we don't have user cookie IDs.

More questions to ask the COUNTER team.

jggautier commented 4 years ago

Best effort (instead of all-or-nothing)

@djbrooke shared a confirmation from Daniella Lowenberg that no fields are required and that a best effort should be made to convert "classic", non-standard counts to standard counts. The guides and specs say counts need to exclude bot/crawler activity and double clicks.

If Dataverse repositories believe or are not sure if their download/explore counts include bot/crawler activity, they should make their best effort to exclude as much as possible by:

matching the ip addresses in their access logs to lists of known bot/crawler ip addresses
?

Dataverse repositories should also make their best effort to exclude double clicks from their "classic" counts, which can be done by using the timestamp and session cookie ID of each download/explore recorded. Timestamps and session cookie IDs may be found in the Dataverse repository's database and/or its access logs.

After this best effort for filtering counts, counts can be converted to MDC counts, even if the repository believes some counts may represent bot/crawler activity or double clicks.

This line of reasoning seems different than the thinking behind the proposal that @qqmyers supports for providing "classic counts" in addition to MDC counts, but...

Counts with unknown month/year

The standardized reports are made monthly, so I'm assuming that Dataverse needs to report the month and year when each download/explore occurred. For example, in this example JSON report, the "Period" object on line 70 must be included and must contain a "begin-date" key with a date value that is the first day of the report's month.

This requirement seems justified since, just like excluding bots and double clicks, knowing when some activity happened improves the quality of the metric.

Are there repositories that have downloads/explores whose months and years can't be determined? If yes, what should be done about those counts?

Harvard Dataverse's database doesn't have timestamps for around 600k download/explore counts. I think the GitHub issue about this (https://github.com/IQSS/dataverse/issues/3324) doesn't mention trying to find this information from access logs. But @landreev or @scolapasta, would you know if this was already considered?

I queried the database to see which datasets in which dataverses would lose download/explore counts if Harvard Dataverse removed the counts with no timestamps, and can share the results if needed.

scolapasta commented 4 years ago

@jggautier we did consider it, but unfortunately, these downloads are from before we tracked downloads individually, i.e. we just had a count. When we went to the current model, we created a row for each, but didn't know if the specific date. Access logs from that time are no longer available. (discussed with @landreev to confirm, as well)

jggautier commented 4 years ago

Great, thanks @scolapasta and @landreev!

I think we could contact installations to find out:

Which installations have counts that can't be converted because the month/year is unknown.
What those installations want to do about the counts where the month/year is unknown. (Discarding counts that can't be converted would mean that many of the older datasets in Harvard Dataverse would lose the majority of their counts.)

Would the "best effort" instructions we provide to installations for converting the timestamped counts include what @pdurbin wrote earlier?:

Create a new Dataverse API for sysadmins to use to export from Dataverse a series logs that are compatible with Counter Processor (one per month for 10 years, for example)

Use Counter Processor to populate the new "datasetmetrics" table used by MDC by processing those logs that were exported.

Use Counter Processor to send the historical data to DataCite.

djbrooke commented 4 years ago

@jggautier thanks! All sounds good.

For converting counts, instead of contacting other installations we could make a decision on our own for Harvard Dataverse and then provide the guidance/infrastructure to installations so they can do what we did (or use some parts of the guidance/infrastructure to make their own decisions about old counts).

Generally, what @pdurbin suggested makes sense as well.

jggautier commented 4 years ago

Some questions and observations that might help with the decision:

Already noted that if we don't make available (in any way) the download counts that can't be converted, that would mean many of the older datasets in Harvard Dataverse would lose the majority of their existing download counts. Could the decision to not make the "classic" counts available in any way be based on how these dataset owners are using these counts (such as reporting them to funders) and what would happen if the classic counts were unavailable to depositors/people who manage those datasets?
Assuming that when MDC is turned on, we'll continue to store and collect "classic" download counts in the Dataverse database and we could somehow make those "classic" counts available (not in the UI), if there are people or systems who will get dataset metrics only from DataCite in the MDC standard, those people/systems wouldn't know about the download counts that couldn't be converted to the standard. For example, someone using DataCite's APIs to programmatically get activity counts of datasets or another platform using DataCite's APIs to display activity counts wouldn't know about a particular dataset's 12k "classic" downloads that couldn't be converted. For a better indication of that dataset's popularity/quality, those people/systems would need to get the "classic" download counts from Harvard Dataverse.
If the "classic" download counts are somehow provided to dataset depositors/curators and others, will making those download counts harder to find, because they won't be displayed in the UI, make them less effective at indicating popularity/quality? For example, will the counts be less effective if they're only in a tool tip or available only using the API and not as easy to find as the MDC counts?
When MDC is turned on, what if Harvard Dataverse never reports/displays MDC download counts for the 1446 datasets whose "classic" download counts can't be converted? For those datasets, only the non-MDC download counts would be available. Harvard Dataverse would still provide pageview counts for those datasets, and the MDC reports sent to DataCite for those datasets would always include only the pageviews. (That is, in the MDC reports, "requests" would never be included for those datasets, and the "investigations" would include only pageviews.) People/systems couldn't get download counts from DataCite for these datasets.
Or what if Harvard Dataverse does report displays only MDC download counts, including for those 1446 datasets, but also indicates that the dataset has download counts that couldn't be converted to the MDC standard and are available in some other way (like using the Dataverse API)?
@djbrooke asked what if we add dates to downloads that don't have them? The date might be some date before Dataverse started tracking when files were downloaded. @scolapasta said that was before guestbooks were introduced. Or maybe the date could be 2008-06-01, the first day of the month before 2008-07-31, which is the earliest download date (guestbookresponse.responsetime) recorded in Harvard Dataverse.

jggautier commented 4 years ago

We decided first to work on converting counts that have timestamps and exclude counts that don't. While that work continues, we'll keep looking into what to do with the counts that don't have timestamps (including asking other repositories if they ran into this issue and how they handled it, and if the MDC leaders have suggestions).

Continuing from what @pdurbin wrote about converting timestamped counts:

Create a new Dataverse API for sysadmins to use to export from Dataverse a series of logs that are compatible with Counter Processor (one per month for 10 years, for example)

Does that mean that this sysadmin API will take the timestamped counts and use information in the database and access logs to try to:
- remove download counts caused by bot/crawler activity?
- distinguish between duplicate and unique counts?
The API would export a series of logs. Are the logs the JSON reports that are sent to DataCite (like this example), mentioned in the third step below?

Use Counter Processor to populate the new "datasetmetrics" table used by MDC by processing those logs that were exported.

Use Counter Processor to send the historical data to DataCite.

This will include all counts up to and including the previous month, right? For example, if on June 15 the Counter Processor is used to populate the new "datasetmetrics" table and send the historical data to DataCite, the historical data would be activity (downloads/explores) that happened up to and including May.

Then when the repository starts sending regular monthly reports to DataCite, it'll continue with the following complete month? In the example, the report would be sent sometime in July and include activity in June.

qqmyers commented 4 years ago

counter_processor should be able to remove bot and duplicate counts if the info is available, so if the sysadmin has a way of determining these things, it might be easiest if they just include that info in the made-up log rather than trying to process themselves. (I'm not sure I understand what info is available but suppose, for example that only one person has requested access to restricted files in a dataset and there are three quick downloads of them. If there isn't any info about the session or IP address to give counter-processor, one could still infer that they are from the same person. If so, rather than trying to calculate whether the downloads should all be counted, one could just add made-up info about the session to the log and let counter-processor do the math. Similar for other things - rather than making the decision in new code, one could just make sure the choice being made about whether its the same person, or whether it was a robot is conveyed in the made-up log file and counter-processor can handle it from there.)

counter-processor nominally runs daily and reports in the current month, up to yesterday. Once it's processed, it reports to Dataverse, which clears the table for the current month and takes the new aggregate info, and it updates with DataCite - again reporting new aggregate numbers that are used in place of the old ones.

To process old months, you can tell counter-processor the year-month combo to process and will process all the logs for that month and send one report to Dataverse and DataCite, i.e. no need to run it for each day in an old month. Similarly, if the cron breaks and you skip some days, as long as you process again in that month, counter-processor will catch-up and report all the days it missed.

landreev commented 4 years ago

Aside from the question of how much, and what kind of extra info can be extracted from the logs, I just want to make sure we don't count on it too much. Specifically when the downloads that happened in the past are concerned. In our own production environment we only keep the access logs around for some number of months. We have never assumed that it was something we want to archive and store permanently. We should also assume that the way other installations handle this may vary, spanning the full range from permanently archiving all their access logs, to not having the access logs enabled at all.

For the purposes of accounting for ongoing downloads, it is safe to assume that the logs for, say, the past week are always going to be present. (And we can instruct other installations that this is required). But it would probably be worth to ensure that we store all the information needed for the MDC metrics right away. So that going back to reading access logs isn't going to be necessary (this may already be the case, going forward, I'm just not 100% sure from reading the above).

landreev commented 4 years ago

@qqmyers (et al) On the issue of recognizing "bot/crawler activity": In the context of our own prod. environment here at Harvard, when we talk about "bot activity" we often mean a very specific type of activity: A very large portion of our downloads here are "partial downloads" from some automated archival network - where a user publishes a (large, multi-GB) file and then armies of these automated clients keep making calls for very small portions of these files, often just a few bytes at a time, with the byte range argument added to the download urls. These clients make thousands of these calls; that, in the Dataverse accounting system result in full GuestBookResponse entries; so for all practical purposes there's no way to tell that those were not real, full downloads of that multi-GB file! I opened IQSS/dataverse#5957 for this last year; would be super easy, to add some simple way for marking these downloads (or to not create "guest book responses" for these at all!) but for whatever reason it ended up on the back burner. (for the record, the only reason this even works is that we redirect to S3 - the extra offset/length parameter is passed to AWS which recognizes it. Dataverse, on the other hand, does not. So for a locally stored file it would actually make an honest attempt to serve the entire file! - another potential easy fix). In our production here I can go back and recognize these partial downloads in the access logs - but only because I added "%header.range%" to the Glassfish access log configuration. But, again, this can only be done for as long as the logs are kept around.

I guess the bottom line is, this would be something very specific to individual installations, and their use patterns. And subject to their admins looking at the logs and recognizing some use patterns as bot activity. (the word "bot" in the logged %header.user-agent% field can be a giveaway too). But I don't think we'll be able to provide scripts that would reliably do this for other installations. (And I understand that this is somewhat along the lines of what you were saying above too).

qqmyers commented 4 years ago

@landreev Once mdc logging is enabled, all the info needed is in the new logs - no need for access logs anymore FYI - I'm @qqmyers (one -e) so I'm not getting your @qqmeyers For bots, counter-processor uses dynamic lists for bots and machine counts (see https://raw.githubusercontent.com/CDLUC3/Make-Data-Count/master/user-agents/lists/robot.txt for the list of words/regexs it looks for) . As long as the user agent info is available, the range info may not be needed. (FWIW - with range queries - that's what's needed to be efficient with mime-type recognition (#6762 ) - you usually only need the first few to ~1K bytes, so making range queries possible in the StorageIO classes (file included) will be needed there. That should help the non-redirect download case above if range handling is allowed in the api.)

landreev commented 4 years ago

I'm @qqmyers (one -e) so I'm not getting your @qqmeyers

lol, yes - but do note that I eventually noticed, and corrected it!

landreev commented 4 years ago

As long as the user agent info is available, the range info may not be needed.

True. Most of our byte range access requests are from this user agent "reqwest": (but do note almost just as many with no user agent supplied)

# grep -v 'NULL-HEADER-RANGE' access/server_access_log.2020-0[45]* |
> grep 'GET /api/access/datafile' |
> awk '{print $9}' |
> sort | uniq -c | sort -nr +0 -1
 266606 "NULL-USER-AGENT"
 203893 "reqwest/0.9.24"
  94405 "reqwest/0.9.22"
   3744 "reqwest/0.9.19"
   1729 "Mozilla/5.0
    489 "Mozilla/4.0
     17 "Core/1.63.5221.400
     14 "okhttp/3.9.0"
     14 "facebookexternalhit/1.1
     12 "reqwest/0.9.20"
      5 "FDM/5.1.38.7312"
      2 "Python-urllib/3.7"
      1 "Wget/1.19.4
      1 "HUAWEI_SCL-TL00_TD/5.0

jggautier commented 4 years ago

Thanks. This is great! I'd like to make sure we're using the same terms for things, preferably the terms used in the COUNTER spec, to reduce confusion (maybe just my own confusion as the least technical person here :) ). For example, @qqmyers you wrote that "counter-processor uses dynamic lists for bots and machine counts." These dynamic lists are used for recognizing activity that should be excluded, right? But the COUNTER spec allows for counting "legitimate" machine activity, like scripts run on a user's computer. I wouldn't want anyone else reading this thread to think that machine counts and bot counts are the same thing. It's important that any method for excluding activity doesn't exclude legitimate machine counts.

I've been keeping a table (just moved it to a Google Sheet) so that I understand how the decisions we make, and the decisions made by CDL's counter-processor, affect counts in any situation I'm aware of, including the instance of partial download counts reported in IQSS/dataverse#5957. If this might be helpful for anyone else, please feel free to contribute or question an interpretation. (I know that for this issue's scope, only the situations involving downloads/explores/requests matter, since Dataverse installations haven't been counting other types of activity, like pageviews (outside of third-party analytics tools)).

@landreev. You wrote in IQSS/dataverse#5957 that the group that downloaded only a part of many files did that "to keep their distributed copies in sync". I just want to make sure I understand why we should label that activity as bot/crawler activity. The COUNTER spec defines "Internet Robots and Crawlers" as agents whose actions are "intended for search indexing and related applications".

So do we agree that machine activity related to keeping distributed copies in sync would fall under "related applications", or that it's a process of that group's search indexing, and should be considered bot/crawler activity?
Are we suggesting that any machine activity where only part of the file is downloaded should be considered bot/crawler activity? When someone tells a machine to download only part of a file, is it only for search indexing/related purposes?

Since there's no "one-size-fits-all" method for recognizing "bot/crawler activity", are there a number of methods we can recommend to installations to ensure that they exclude bots/crawlers? It sounds like installations can't always rely on user agents or the dynamic lists that counter-processor uses (we couldn't have anticipated that "reqwest" would start download parts of thousands of files). But if I understand what @qqmyers wrote, if an installation discovers activity that was reported as legitimate but should have been excluded as bot/crawler activity, that installation can re-send corrected reports for the months of that activity.

landreev commented 4 years ago

@jggautier Re: all those automated partial downloads: I can look up the actual conversation with the user who published the original file that started the whole thing. I don't remember all the details tbh; that big 2.5GB file is an archive of some TeX macros. Why the individual clients need to make thousands of partial downloads at regular intervals exactly, I can't necessarily explain. But I do think it's not a stretch to say that it probably falls under "search and indexing and related applications".

My main arguments in favor of NOT counting these as regular downloads: It just feels wrong. I'm not talking about the MDC metrics - just the fact that these thousands of downloads of just a few bytes each are shown as if somebody actually downloaded a multi-GB file thousands of times. It does likely distort our download statistics at this point. Also, it's kind of a waste of resources on the Dataverse side - each download count is actually a database table entry (in the GuestBookResponse table).

Another way of looking at this, maybe we should not focus on the "partial", but instead on the fact that they are repeated downloads from the same IP address. So, perhaps we should count the first one as a regular download; but ignore the consecutive hundreds of them if they are happening within some number of hours; or within the same day etc. This way we don't need to pay attention to whether they are requesting just a few select bytes, or the whole file.

landreev commented 4 years ago

(Just want to emphasize that I don't have a strong opinion on how we should be treating these "partial downloads". My only strong opinion is that what we are doing now feels wrong. Not counting any partial downloads would be a bit over-the-top. Asking for 1GB from a 2GB file, for example, does feel like a "real" download... So maybe my other idea, not counting repeated downloads on the same file over a short period of time is a better bet?)

jggautier commented 4 years ago

Ah, thanks @landreev. I'm leaning toward thinking that the partial downloads reported in https://github.com/IQSS/dataverse/issues/5957 should be excluded from the MDC downloads, too (should not be populated in the datasetmetrics table so shouldn't be reported to DataCite).

I've always assumed that https://github.com/IQSS/dataverse/issues/5957 wasn't prioritized because this MDC work would take care of problems like these, by excluding this type of activity or differentiating unique from total counts, and "regular" counts from machine counts.

Does that make sense? If it's not clear from the conversation with the user who published the original file that started the whole thing in https://github.com/IQSS/dataverse/issues/5957, if you like I could reach out to them to ask.

There might be Dataverse installations that continue to provide non-MDC counts, which also contain repeated downloads of the same file over a short period of time. But I think it's better to encourage and help Dataverse repositories to follow a standard other platforms are following, too.

Also, it's kind of a waste of resources on the Dataverse side - each download count is actually a database table entry (in the GuestBookResponse table).

Maybe this could be resolved as the GuestBookResponse table is further separated from the download stats? (Of course I'm thinking this is out of scope for this issue/spike.)

I think the next steps are:

Determine if the partial downloads described in https://github.com/IQSS/dataverse/issues/5957 should be excluded as bot/crawler activity or counted as legitimate machine activity (does that involve adding a user agent to a blacklist?)
Test populating the datasetmetrics table (also excluding downloads/explores in the guestbookresponse and/or filedownload tables that don't have timestamps)
Compare the MDC counts in the datasetmetrics table to the "classic" counts in the guestbookresponse and/or filedownload tables to make sure everything's working as we expect (Seeing the differences will also help us determine how to communicate the changes in counts once they're later displayed in the Harvard Dataverse UI)
Once we're satisfied, populate the datasetmetrics table in Harvard Dataverse's production database and send the reports to DataCite

landreev commented 4 years ago

@jggautier @qqmyers @pdurbin @scolapasta @djbrooke Back to this issue. A piece of good news: we have significantly more saved access logs than I realized/reported yesterday. Turns out I squirreled away and saved the logs from the LTS-maintained servers back in 2018, as we were moving to AWS. Continuing the squirrel analogy, somewhat gratuitously, it appeared that I had forgotten where exactly those nuts were buried. But I found them after all, so we now have the apache logs for the entire, or most of the, lifespan of Dataverse 4 - beginning April 2015. Or the time period during which 90% of all the "legacy" downloads were served. This does NOT necessarily mean that we'll be able to reliably track every single GuestBookResponse record from the database to its access log entry (there may be some gaps, or logging failures). And some log entries are still missing at least some of the info that MDC likes to have. But we should have enough of that information for this whole idea of cooking up MDC logs retroactively to be looking like a viable option again; as we were having some doubts about it yesterday.

I still don't fully know how to get it done. But I now understand what we are trying to achieve, enough to know what questions to ask. So I'm going to be doing that, below.

landreev commented 4 years ago

@jggautier @qqmyers @pdurbin @scolapasta @djbrooke OK, for most (?) of you, the summaries below is something you already know. This is for my benefit - let me know if I'm still not getting it right.

The reason we want to convert these "legacy" downloads into MDC is that otherwise the current counts will reset to 0 when we turn MDC on.
QDR/@qqmyers's solution (#6543, show legacy and MDC metrics, both) was considered, but rejected. (IMHO - I feel like it's worth reconsidering).
The things we are hoping to extract from the logs, that our GuestBookResponse records do not have: USER AGENT: (browser, software client, etc.) important because Counter Processor relies on it to identify "machine access" requests. (Although, in an apparently rather crude/primitive way!) None of our GuestBookResponse entries have that; some of our logs will have that. IP ADDRESS: SESSION ID: Same as above; CP uses this info to filter out repeated ("double click") requests. The access logs entries will have this info for some downloads, but not for all. Questions:
1. Is it ok to cook up MDC logs for the downloads for which we don't have all the info above? Rather than dropping them? This may have been actually answered, in that "best effort" discussion way above; but I'm just making sure. Sounds like yes, I am cooking up MDC entries, for them with "-" in place of the missing values above for these, correct?
2. What exactly am I doing for the downloads for which we don't have ANY information at all? These are the prehistoric, DVN-period downloads for which we only know the counts per file, not even dates. There's no way to retrieve that information. These constitute only ~5% of all the counts (~700K out of 15M). But there are probably specific files for which these legacy^2 downloads constitute most of the counts. So I don't think we want to drop them outright. But then the only way to get them counted w/ MDC would be to make up some fake dates for them. Up in the issue there's a quote from the MDC spec that says that everything in that CP log file is "optional". But then CP logs themselves are dated (there's one log file per day); so basically I would need to pick some fake date in that "pre-historic" range, and place that download into the CP log for that day. This feels pretty... sketchy - but I can't think of a better way. And I am open to suggestions.
3. Again, this may have been clear to everybody else but me, but to confirm: Even for the downloads for which all the extra information can be obtained from the logs, the MDC counts may end up lower than the previously maintained legacy counts. Especially so, actually - since MDC uses that extra information to identify some downloads as "double clicks" and such, and drop them from the counts. Provided the above is correct, is everybody ok with this? It kinda sounds like at least some people were actually viewing these numbers as meaningful and important... are they going to be ok seeing them go down?
4. Please note that the issue of the "Tectonic" downloads, that was discussed at great lengths above is NOT an open question anymore; that's been resolved and I know what to do with them exactly.

Sorry that was so long. I'm not personally convinced all that MDC business is worth this amount of attention we've already put into it. But it sounds like it's important to at least some people - so here we are.

landreev commented 4 years ago

@jggautier @qqmyers @pdurbin @scolapasta @djbrooke What exactly is the time frame for this anyway? Is there any need to start showing MDC metrics in our production starting any specific date? Per @qqmyers' advice, I have enabled MDC logging on our prod servers. Without switching to displaying MDC metrics just yet.

landreev commented 4 years ago

@jggautier @qqmyers @pdurbin @scolapasta @djbrooke Unrelated to the legacy counts, but a couple of questions about the MDC logs:

Is that by design that we are saving the storage identifier (for ex., s3://dvn-cloud:1523be...) in the filename column? Shouldn't it be the real filename, from the FileMetadata, under which the user is actually downloading it? (is it ever used for any meaningful purpose?)
Not a question, but - yes, it does look like even going forward there will still be some required pre-processing that we'll need to be doing on these MDC log files before feeding them to Count Processor; consulting the access log file in the process. Specifically, the ip address in the counter*.log file on our prod. servers is NOT the real ip, but the ip of the AWS ELB proxy. I think in practical terms, that would result in Counter Processor potentially mis-typing many guest downloads from different users as repeated downloads coming from the same address. So we'll need to either get the real address from the access log file; or add code to the application to (optionally) get it from the x-forwarded-for, or another, configurable header or cookie. Modifying the "user agent" field for what we know are "machine downloads" is another case.

qqmyers commented 4 years ago

1) Not sure why this choice was made - seems like the real filename makes more sense unless there's something in the spec that specifies you want something more unique (e.g. than readme.txt) 2) BUMMER! Looks like QDR has this issue as well and the apache logs we have also capture the LB IP addresses as far as I can tell (still checking). Aside from potentially causing collisions, where I think the session Ids will still help distinguish, this will completely break geo-spatial assignments which is another important add of MDC. Regardless of how fast back logs are processed, getting good info for new counts, whether through local config of AWS LBs or code change should probably be prioritized (and shared with the community assuming others use load balancing.)

qqmyers commented 4 years ago

W.r.t. your earlier comment: 3) I think one point of MDC is to get realistic counts (from everyone, some systems have more robot counts than others), so the decrease is probably expected/acceptable as long as it is really due to getting rid of bad counts and not missing good ones due to lack of logs. For 1 and 2) Those are good policy questions - to decide what 'best effort' is and whether best effort far back in time is good enough to report things as MDC. My guess is that when you have no dates at all, doing something like assigning everything to your starting date or smoothing them over the whole prior time period, etc., or assuming a typical rise/decay pattern after publication, or something else could be done, but I'd guess avoiding too much fidelity when its all simulation is best and that a simple solution like everything on one date makes the most sense. Again - judgement call.

qqmyers commented 4 years ago

One other note - I don't think Dataverse keeps any detail except the final counts and that may be all that gets to DataCite as well (I think that's true) so the filename being storageidentifier may be moot/just used internally by CP if at all and as long as its unique it should be OK.

landreev commented 4 years ago

@qqmyers BTW, I kept saying "apache logs", but they are really glassfish access logs. It is the same log file format as Apache, I believe. Both Apache and Glassfish are going to log the proxy address(es) - because that's where the requests are coming from. The only reason we have the real address logged is because we have added the x-forwarded-for header explicitly: <access-log format="%client.name% %datetime% %request% %status% %response.length% %header.user-agent% %header.referer% %cookie.JSESSIONID% %header.x-forwarded-for% %header.range%"></access-log>

landreev commented 4 years ago

@qqmyers

... Aside from potentially causing collisions, where I think the session Ids will still help distinguish,

That's why I said "guest downloads" - I was assuming anonymous downloads of public files are not going to have session ids... - but I have checked and it looks like by the time an MDC log entry is created, even an anonymous call is already issued a guest session with its own id... So, is it a reverse problem then? If I keep downloading the same public file with curl every second, without supplying an api token, or a session cookie, I get a new session every time - and MDC will treat them as legitimately distinct downloads? OK, I don't think this is a "problem" per se, that we should be losing sleep over.

jggautier commented 4 years ago

Looks like your timeframe related questions were answered in a Slack conversation today. I thought I'd try to reply to the questions that I'm able to.

Is it ok to cook up MDC logs for the downloads for which we don't have all the info above? Rather than dropping them? This may have been actually answered, in that "best effort" discussion way above; but I'm just making sure. Sounds like yes, I am cooking up MDC entries, for them with "-" in place of the missing values above for these, correct?

Yes, the only technical requirement is knowing the month in which the usage happened.

What exactly am I doing for the downloads for which we don't have ANY information at all? These are the prehistoric, DVN-period downloads for which we only know the counts per file, not even dates. There's no way to retrieve that information. These constitute only ~5% of all the counts (~700K out of 15M). But there are probably specific files for which these legacy^2 downloads constitute most of the counts. So I don't think we want to drop them outright. But then the only way to get them counted w/ MDC would be to make up some fake dates for them. Up in the issue there's a quote from the MDC spec that says that everything in that CP log file is "optional". But then CP logs themselves are dated (there's one log file per day); so basically I would need to pick some fake date in that "pre-historic" range, and place that download into the CP log for that day. This feels pretty... sketchy - but I can't think of a better way. And I am open to suggestions.

I think @djbrooke and I agreed that the downloads with no dates would be dropped (they wouldn't be reported to MDC and wouldn't populate the datasetmetrics table), instead of assigning dates to them. At least, I was convinced by an email conversation with Martin Fenner not to assign dates to them; he wrote that knowing usage dates is especially important. The more accurate the usage information available to the MDC folks (and the rest of the community), the better research they can do into how datasets are being used. It's already not ideal that we'll be sending "legacy" counts that will include double clicks (among other things that ideally would be excluded or labelled). But we were especially discouraged from faking or estimating the dates.

Again, this may have been clear to everybody else but me, but to confirm: Even for the downloads for which all the extra information can be obtained from the logs, the MDC counts may end up lower than the previously maintained legacy counts. Especially so, actually - since MDC uses that extra information to identify some downloads as "double clicks" and such, and drop them from the counts. Provided the above is correct, is everybody ok with this? It kinda sounds like at least some people were actually viewing these numbers as meaningful and important... are they going to be ok seeing them go down?

Agreed, we know of at least a handful of groups using Harvard Dataverse who are viewing these numbers as meaningful and important. We spoke with MDC/California Digital Library folks about communicating with depositors who ask about their decreased numbers, and they said the people who've contacted them to ask about the change were okay with the explanation that it's part of an effort to get more accurate and standardized counts. How best to communicate that, especially to the groups we know pay attention the counts a lot, was something we'd figure out later (maybe informed by what the changes look like).

QDR/@qqmyers's solution (#6543, show legacy and MDC metrics, both) was considered, but rejected. (IMHO - I feel like it's worth reconsidering).

I think learning more about how people use these numbers now and how different solutions would affect that use would help. Maybe that's been done and the grant driven deadlines meant there hasn't been time for communicating the reasoning behind the solution we're moving forward with?

djbrooke commented 4 years ago

Thanks @jggautier - regarding the multiple counts, I understand it's a solution, but it's confusing to show multiple counts. Places like Dryad, DataOne, and Zenodo* show one count and I'm sure we can figure this out.

I think, but it's down right now ;)

landreev commented 4 years ago

@qqmyers Please scratch what I said earlier, about guest/anonymous downloads, and session ids. I was actually right the first time around: when you download a public file without supplying either an API token, or jsession, there is no session id in the MDC log. The session id is only there when it's a web browser guest user session - where each guest user gets assigned a jsession id.

So yes, CP does need the real ip address in the log file, to be able to tell that these anonymous downloads are not all coming from the same host.

landreev commented 4 years ago

@qqmyers @pdurbin I'm seeing some slightly inconsistent behavior in how file downloads are logged in the MDC log: If I download the file from the UI page, the value in therequest_url and target_url fields in the log file end up being "/api/access/datafile/NNN" But if I access the same file via the API, it's the full URL that's logged in these fields: "https://dataverse.host.edu/api/v1/access/datafile/NNN" or "https://dataverse.host.edu/api/v1/access/datafile/:persistentId?persistentId=doi:xxx/yyy/..."

Once CP runs on these logs and the resulting JSON is imported - is Dataverse going to be ok with all these different ways of referencing the same file? (is that why we are logging the storage identifier there, and not the filename btw?)

I could try and read the code of that addUsageMetricsFromSushiReport api, but figured I would ask first.

landreev commented 4 years ago

@qqmyers @pdurbin I know this was discussed in the past - but I can't find any references to it: since we have 2 prod. servers, there are two sets of counter_* log files. Do they need to be merged, before Counter Processor is run on them? Or can CP be run separately on the two servers, and can the resulting SUSHI files be separately imported?

qqmyers commented 4 years ago

I don't know why the api and UI should be different w.r.t. including the host, but CP processes based on regexs and the default one from the guides has ^.*/api/v1/datasets as one entry - which I think gets both. If not, this can be managed with a change to CPs config file.

Wr.t. two logs, I think they'll need to be merged. One issue is that when the json in imported to DV, any entries for that month are deleted/over written so only one log will survive. I also don't think CP would have any way to detect duplicates across those two logs without merging - CP does keep state, but I don't know that it can aggregate state from two logs. There's also probably an issue with DataCite (if that's enabled) - CP calls that directly and while I think it handles daily updates (overwriting partial versions of the same month), I don't think it will handle two logs. If there is a way to get CP to aggregate state across two logs, you'd still need to assure that only the second run writes to DV and DataCite. One thing I don't know is whether CP can only handle chronological entries - it is creating a db so if you just appended the two logs for the same day into one, it might be OK with that - it may read and store the info and then do it's processing to detect duplicates. Not sure, but it might be worth a test before trying to sequence log entries.

landreev commented 4 years ago

@qqmyers

One issue is that when the json in imported to DV, any entries for that month are deleted/over written ...

Bummer - I was hoping it could be added to what's already there. Not a problem at all to merge the 2 logs. Just another extra step.

landreev commented 4 years ago

@qqmyers I'll merge them sorted by the time stamp, to be safe.

landreev commented 4 years ago

@qqmyers @pdurbin Wanted to run one other quick thing by you:

There's also probably an issue with DataCite (if that's enabled) - CP calls that directly ...

I've been operating under the assumption that even on an important (production) server, if anything gets screwed up with this MDC processing setup, there's an easy way to undo everything. By just dropping the datasetMetrics database table. And then rebuilding from scratch. Is this correct by itself? And, is it still correct with DataCite in the mix? - We would not need to "undo" or fix anything on their side, if we want to rebuild all the metrics. It will simply be overwritten in their database when we submit the regenerated stats - right? Thank you.

qqmyers commented 4 years ago

I think so. DataCite allows overwriting to update with new daily counts, but it relies on some report id being kept. I think DataCite will also replace a month if you start over with reprocessing, but I'm not certain. Regardless - I think the API exists to delete reports, etc. So there shouldn't be anything you can't undo - just a question if there's a manual step. I think Phil had some developer notes about MDC/CP processing that talks about it keeping state and what to do about DataCite if you reprocess (could be from the CP readme, but I think its in the DV dev guide.)

qqmyers commented 4 years ago

One other note - for the mdc logs themselves, in 4.17 and fixed somewhere later there was a problem that the mdc logs were getting entries for draft access/downloads and those entries lacked a doi in one of the columns which makes CP fail. For QDR, I removed all entries with ':draft' in them and then had to look for libcurl entries where there was no doi (not too many so I didn't try to come up with a regex pattern - the libcurl entries that were problematic had three blank fields before them instead of 2 which I could check visually - same underlying issue - draft accesses reported but no doi sent to the log). For any processing of the mdc logs themselves, you'll need to remove those.

Relatedly, I think the plan is to not report draft entries, so you should try to get rid of those in processing the other logs.

landreev commented 4 years ago

Thank you. The issue in the last comment - I'm not going to encounter any of these broken draft access entries in the MDC logs, because I've only enabled MDC now, under 4.20 (correct?)

qqmyers commented 4 years ago

Yes - that's right.

pdurbin commented 4 years ago

Sorry, I don't have any notes about reprocessing. One thing to remind ourselves of is that in our architecture, Counter Processor (not Dataverse) sends data (SUSHI reports) to DataCite. Here's a diagram:

What we're doing on the Dataverse side is process these same SUSHI reports (JSON format) in order to populate the datasetmetrics table. I've always felt like it should be safe to clear out this table if need be.

State is maintained in Counter Processor in a couple different formats. Here's a note Matthew left in the dev guide at http://guides.dataverse.org/en/4.20/developers/make-data-count.html ...

A developer running Counter Processor alongside Dataverse for development or testing purposes will notice that once the raw Dataverse logs have been processed, there is no straightforward way to re-test those same logs.

The first thing to fix is to clear two files from Counter Processor state folder, statefile.json and counterdb[yyyy-mm].sqlite3

... so it's good to be aware of these "state" files.

jggautier commented 1 year ago

Just an update based on recent discussion in the GREI "Use Metrics" committee: DataCite is working on a simpler method for sending metrics (requests and investigations).

During a recent GREI Open Metrics committee meeting, a DataCite representative recommended that those repositories that haven't been reporting these use metrics, such as Harvard's repository, wait until that simpler method is finished being tested and has been released for use.

This new method is being called a usage tracker and it involves adding a snippet of code on each dataset page. I don't know much else, including how it would be involved in converting a Dataverse repository's "classic" download counts to COUNTER compliant metrics.

@pdurbin do you know if the development of this usage tracker is described anywhere online yet?

pdurbin commented 1 year ago

@jggautier from the notes you may have noticed https://portal.productboard.com/71qotggkmbccdwzokuudjcsb/c/62-usage-stats-processing-as-a-service which I believe is the item on DataCite roadmap to watch for development of the usage tracker.

I believe the code is here: https://github.com/datacite/datacite-tracker

As you know, this issue is related, and (I think) more about setting up MDC for Harvard Dataverse (the old way at least, with log processing... maybe we need a fresh issue for the usage tracker):

IQSS/dataverse#3

(This issue IQSS/dataverse#75 is a bit more of ensuring that the download counts aren't zero when we switch from the database counts to the MDC counts. This is something to get right for either solution, log processing or usage tracker. Another option is to show both counts, which is what https://github.com/IQSS/dataverse/issues/6542 is about.)

mreekie commented 1 year ago

Priority review with stefano:

Added to the ordered backlog
Moved from NIH Deliverables Backlog to Ordered Backlog

pdurbin commented 1 year ago

Which MDC option do we plan to use for Harvard Dataverse? Backend log processing? Frontend/Javascript tracker?

mreekie commented 1 year ago

sizing:

this will involve a discussion with the right people to put a stake in the ground on which of the three options to take.
The right people include: Julian, Leonid, Phil.
Julian is aware of the right people to talk to and can bring them to the table.
- Daniella from Dryad - they started using the new metrics.
- sized at a 10

IQSS / dataverse.harvard.edu

Spike: finalize the plan for transition to Make Data Count, how to display the metrics, how to handle legacy counts #75