IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
882 stars 494 forks source link

Implement Backend Support for Make Data Count use and citation metrics #4821

Closed adam3smith closed 5 years ago

adam3smith commented 6 years ago

Following up on the google group discussion here: https://groups.google.com/forum/#!topic/dataverse-community/rQWNllAyTu0

Dataverse should support and display Make Data Count (https://makedatacount.org/) - standardized usage metrics

Slides and QA from recent (July 2018) webinar here: https://makedatacount.org/presentations/

Here are detailed guidelines for implementation: https://github.com/CDLUC3/Make-Data-Count/blob/master/getting-started.md

Here are steps to implement from the project from an earlier presentation

mfenner commented 6 years ago

Feel free to ask any questions DataCite staff can help with here.

pdurbin commented 6 years ago

Here are some notes taken during a meeting on 2018-10-18: https://docs.google.com/document/d/1eM4rAuhmR4ZQxJC_PTE0rq2x7N3aNEjMN7QVvpkY1os/edit?usp=sharing

djbrooke commented 6 years ago

We'll determine how these metrics appear on the page as #3404 moves through our design process, but there's an opportunity to get the backend pieces in place. Some proposed steps for discussion and estimation:

This will position us well for implementation once we have the designs further along and validated.

pdurbin commented 6 years ago

I met @mbjones at Whole Tale Workshop on Tools and Approaches for Publishing Reproducible Research and he mentioned he'd be happy to field technical questions we have about DataONE's implementation of Make Data Count.

Meanwhile, DataONE put out a blog post at https://www.dataone.org/news/new-usage-metrics that has some nice screenshots of a dataset at https://search.dataone.org/view/doi:10.5063/F1Z899CZ which I'll put below:

dataone_implements_new_usage_and_citation_metrics_to_make_your_data_count_dataone_-_2018-11-13_14 47 21

mbjones commented 6 years ago

Happy to help, @pdurbin. The time series graphs you cited were made much faster by caching results locally and then enabling group by at various levels of aggregation. The d3-charts we build and other visualizations are all part of our open source MetacatUI data portal frontend, so you might find some of that reusable.

pdurbin commented 6 years ago

@mbjones thanks. Is there any reusable Java we might interested in as well?

All, at standup today I said I was close to pushing some docs that capture my understanding of what we're trying to implement. These docs are in 4dd10bd but I'll add them as a screenshot below as well. I also stubbed out some API tests but nothing has been implemented yet. It's all just stubs. Feedback is welcome.

make_data_count_ _dataverse org_-_2018-11-19_16 31 47

pdurbin commented 6 years ago

Here's a to do list of tasks that are top of mind for me.

I also wanted to note that I set up a Jenkins job to build the guides from the branch I'm using to http://guides.dataverse.org/en/4821-make-data-count/admin/make-data-count.html

I asked the Dataverse community for feedback at https://groups.google.com/d/msg/dataverse-community/rQWNllAyTu0/RMD0GEFzAgAJ

mbjones commented 6 years ago

@pdurbin We didn't implement this in Java, so no Java code to share there. We have an index processor and metrics service in python if you have interest in that.

Also, reading your document, I see one little difference from the DataONE interpretation when defining Views and Downloads. Like us, you are using the terminology “Views” and “Downloads” over “Investigations” and “Requests”. So, we should be sure we are using those the same way. I think our implementation is Downloads == Requests, and Views = Investigations - Requests, whereas you seem to state that Views == Investigations. We made Views be the difference so that they were independent metrics -- Views basically represent how many times the landing page has been looked at or the metadata was accessed, whereas Downloads is how many times all or part of the data was accessed. Does that make sense to you?

pdurbin commented 6 years ago

@mbjones I'm not sure what an index processor is but I know we both use Solr so I guess I'll take a link to that code as well as the metrics service if it's not too much trouble.

I understand what you're saying about the meaning of "views" but let me read the specs and talk to others on my team before I respond. I'm also curious what DASH does. We've always shown downloads in Dataverse but the idea of showing views and citations is new to us. Thanks for the feedback!

mbjones commented 6 years ago

Sorry about the confusion over 'index processor' -- that is our component that takes our raw usage logs from apache and other sources and processes them to insert usage events into our ElasticSearch index, which we then use to send stats to DataCite. Its pretty well customized to DataONE so probably not a lot of general utility except as an example.

djbrooke commented 6 years ago

Thanks @pdurbin - the process in the doc makes sense to me. I added a small comment/question and I'm interested in the thoughts from @mfenner and the rest of MDC team and also @scolapasta and others on the technical implementation.

@mbjones thanks for the feedback here as well!

mbjones commented 6 years ago

@pdurbin @djbrooke in the list above, you asked:

If Dataverse can express data citations, can the DataCite hub receive them?

The answer is yes, but not the same way as usage metrics. DataCite already supports linking to publications in the DOI-related metadata that you submit with your DOIs using the <relatedIdentifier> element. See the DataCite EventData Guide. These publication linkages are parsed and added to the EventData source. So, the "hub" is used for reporting Investigations and Requests as counts, whereas every individual citation event is reported in the DOI metadata.

The only problem we have with this approach is that it is completely DOI-centric, and we have many data sets that are not identified with DOIs. I think you also have some with Handles, right? In any case, I'd love to have an API for collating citations for any identifier type, including Handles, ARKs, UUIDs, CURIEs, etc.

jggautier commented 6 years ago

Question: Can Dataverse express data citations? Can "Related Publications" be used?

Dataverse has a related dataset field (mentioned in https://github.com/IQSS/dataverse/issues/5277), although DataCite expects identifiers, and the related dataset field is a free text field. We expect "Related Publication" to be used mostly for text-based publications (and that's how it's mapped to the DDI exports).

If we use Related Publications for datasets as well:

If Dataverse can express data citations, can the DataCite hub receive them? In 4dd10bd I only talk about sending views/investigations and downloads/requests. Ask @mfenner

Discussions about Dataverse sending <relatedIdentifier> metadata to DataCite are in https://github.com/IQSS/dataverse/issues/2917 and https://github.com/IQSS/dataverse/issues/2778.

(I think these important questions are more about Dataverse being able to contribute to the quality of Make Data Count's citation metrics, and less about implementing backend support for sending/receiving usage counts and receiving citation counts.)

pdurbin commented 6 years ago

Before Thanksgiving I was telling @djbrooke that I hoped the "counter-processor" Python code from CDL would "just work" with our Apache logs, but it turns out that the logs much contain data repository-specific fields like title, publisher, author, etc. Example logs look like this on a single line:

2018-05-08T00:00:40-07:00 128.195.188.234 - - - http://dash.lib.uci.edu/stash/dataset/doi:10.7280/D1H01B doi:10.7280/D1H01B - - uci-google-search-appliance (Enterprise; T4-CGX5LF9EL8JCP; eus@uci.edu,jkreuzig@uci.edu,mehrenbe@uci.edu) Mustard Removal Experiment at Bayview Slope UC Irvine grid.266093.8 Riley Pratt|Jessica Pratt|Jenny Talbot|Stephanie Kivlin|Margaret Royall-Reed|Steven D. Allison 2015-04-14T11:00:46Z 1 - https://dash.lib.uci.edu/stash/dataset/doi:10.7280/D1H01B 2015

And they look like this as key/value pairs:

event_time: '2018-05-08T00:00:40-07:00'
client_ip: 128.195.188.234
session_cookie_id: '-'
user_cookie_id: '-'
user_id: '-'
request_url: http://dash.lib.uci.edu/stash/dataset/doi:10.7280/D1H01B
identifier: doi:10.7280/D1H01B
filename: '-'
size: '-'
user-agent: uci-google-search-appliance (Enterprise; T4-CGX5LF9EL8JCP; eus@uci.edu,jkreuzig@uci.edu,mehrenbe@uci.edu)
title: Mustard Removal Experiment at Bayview Slope
publisher: UC Irvine
publisher_id: grid.266093.8
authors: Riley Pratt|Jessica Pratt|Jenny Talbot|Stephanie Kivlin|Margaret Royall-Reed|Steven
  D. Allison
publication_date: '2015-04-14T11:00:46Z'
version: '1'
other_id: '-'
target_url: https://dash.lib.uci.edu/stash/dataset/doi:10.7280/D1H01B
publication_year: '2015'

I opened https://github.com/CDLUC3/counter-processor/issues/3 to provide a brain dump of my thinking as a potential new user and left a new comment at https://github.com/CDLUC3/Make-Data-Count/issues/99#issuecomment-441779513 and another to do item above to capture what I said at standup this morning, that we need to decide on the approach we'd like to take. The assumption is that we'll be parsing logs in a specific format that we teach Dataverse to write. Another approach could be to extend our guestbook model of recording downloads to also record views in the database but I'm concerned about how much space that would take up.

By the way, thanks to all who have made comments above. I've read them but I'm a little focused on understanding counter-processor at the moment. 😄

pdurbin commented 6 years ago

@scolapasta @landreev @sekmiller @kcondon @mheppler @matthew-a-dunlap and I just had a nice discussion in tech hours about Make Data Count. Here's what I drew on the whiteboard and I'm sorry for the chicken scratch:

img_20181127_160337515

img_20181127_160330387

The most important thing to me was getting consensus on some decisions:

Other items:

Oh, I found a typo in the COUNTER Code of Practice for Research Data an emailed @mfenner about it. I would have made a pull request but it isn't on GitHub. 😄 I'm on page 8. Still reading.

pdurbin commented 5 years ago

I just had a nice chat with @pameyer after standup and I'll make it clear that downloads via rsync will not be reported. He reminded me that in his installation of Dataverse, the download count (part of the "metrics" block) is not present. (This is tied to the :DownloadMethods database setting.)

Pete also reminded me that direct access to data (bypassing Glassfish) is also available in Swift installations. I believe that in the screenshot below, the download count only reflects downloads via Glassfish, not via Swift.

screen shot 2018-11-28 at 12 10 29 pm

In short, I'll update the docs in my branch to indicate the downloads won't be counted for rsync or direct access via Swift.

TaniaSchlatter commented 5 years ago

What I understand from this is that Dataverse won’t be able to track download activity for files accessed via rsync or other methods that bypass Glassfish. This is fine for Pete/SBGrid, which isn’t concerned with tracking anyway, and doesn’t plan to display metrics. When datasets have files that can be accessed via rsync and http, only access via http (or via methods that go through Glassfish?) will be counted.

If so, this implies that we need to be careful about displaying metrics in scenarios where not all access activity is counted equally. The representation of the activity will need to communicate clearly which activity is being counted. This adds UI complexity - more to explain/display, and implies the system needs to track activity in a way that is granular enough that it can be displayed clearly. I don’t know if putting views and downloads together as @phildurbin describes will enable the capture of the information needed.

On Nov 28, 2018, at 12:18 PM, Philip Durbin notifications@github.com wrote:

I just had a nice chat with @pameyer after standup and I'll make it clear that downloads via rsync will not be reported. He reminded me that in his installation of Dataverse, the download count (part of the "metrics" block) is not present. (This is tied to the :DownloadMethods database setting.)

Pete also reminded me that direct access to data (bypassing Glassfish) is also available in Swift installations. I believe that in the screenshot below, the download count only reflects downloads via Glassfish, not via Swift.

In short, I'll update the docs in my branch to indicate the downloads won't be counted for rsync or direct access via Swift.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

pdurbin commented 5 years ago

I don’t know if putting views and downloads together as @pdurbin describes will enable the capture of the information needed.

@TaniaSchlatter sorry, I wasn't being clear. I'm really only talking about where to record each view or download in the database or one the filesystem. The decision was to use the filesystem in a dedicated log. From this log we will be aggregating views and downloads into reports (JSON format) that we send to the DataCite hub. The dedicated log will be rotated, deleted after a year or whatever, so they don't take up too much space on disk. What I'm trying to say is that you shouldn't have to worry to much about what goes in this log. I hope this makes sense. I'll swing by to make sure we're on the same page.

You're absolutely correct that we'll only be able to track downloads that are initiated through Dataverse/Glassfish. We won't be tracking rsync downloads. We won't be tracking direct downloads from Swift ("Cloud Storage Access" in the screenshot above). I talked to @jonc1438 in #5213 and it sound like TRSA downloads will be initiated through Dataverse so we should be able to track them once the TRSA pull request gets merged.

kcondon commented 5 years ago

@pdurbin Any traction on the idea of an API for remote downloads to record downloads back to Dataverse?

pdurbin commented 5 years ago

@kcondon I chatted with @pameyer about it and it doesn't sounds like he's personally interested in implementing it for his rsync server but we talked about how customer number two of all the rsync stuff might want to work on this. I wasn't planning on adding the API to the Dataverse side until we have someone who's interested in using it. It would basically involved parsing rsync logs, from what I understand. I guess the same would be true of Swift logs? I don't know.

pdurbin commented 5 years ago

After discussion with @mheppler and @jggautier and the comment from @mbjones above, I pushed 17cbf37ee to clarify that we plan to send citations to DataCite as part of the Make Data Count effort. It has been emphasized that citations are the most important thing and since we can express these in Dataverse under our "Related Dataset" field and DataCite is ready to receive them, we should try to send them. While were in this part of the code we should also endeavor to send citations for publications as well (details in #2917 and #2778).

jggautier commented 5 years ago

What we think of as citations seems to be one of the three relationship types that the Event Data service is collecting, called "linking events". The other two relationship types are versioning (this dataset I'm depositing IsNewVersionOf/IsPreviousVersionOf another dataset) and granularity (this file is part of this dataset, which Dataverse already sends to DataCite).

Will the "citation counts" that Dataverse receives include all three types of relationships or only certain types? Can Dataverse determine which types of counts it displays? For example, when Dataverse reports "citation counts," it shouldn't include the number of links between a dataset and its files. Could Dataverse exclude that? (I tried seeing what Dash and DataONE do, but haven't found a dataset with a citation count, yet.)

Update: It looks like you can filter certain relationTypes (https://support.datacite.org/v1.1/docs/eventdata-query-api-guide#section-filtering-events-links-by-type), and it recommends certain types to exclude from a citation count.

pdurbin commented 5 years ago

At standup yesterday I indicated that I've finished reading the "COUNTER Code of Practice for Research Data" which brings more questions to my mind. Below I'll put the latest to do list. At standup I talked through open items from the original to do list above at https://github.com/IQSS/dataverse/issues/4821#issuecomment-440327789 and I just updated that comment to indicate the latest status. I'm going to duplicate any open items in the list above so that we can have a single new list in this comment. Here goes.

Questions for Make Data Count:

Questions for Dataverse tech hours (or sooner):

To do:

mfenner commented 5 years ago

@pdurbin happy to help answer these questions, but would it make sense to break this done into several issues? Not one issue per item, but the list is so long that it might become difficult to track the responses.

pdurbin commented 5 years ago

@mfenner hi! If @djbrooke hasn't emailed you and @dlowenberg already, he plans to do so soon. I'm sorry that I didn't have a lot of questions back during our meeting on 2018-10-18 (notes) but back then I hadn't watched two webinars, hadn't read the CoP, hadn't read the "getting started" guide. (I still haven't read the SUSHI spec and I suspect that I should.) Now that I'm feeling more up to speed, I think the next meeting will be more productive. You've seen that I now have a list of questions above. 😄 If it's easier for you and I to have a quick separate call, that's fine too. Please let me know what makes the most sense to you. Thanks!

mfenner commented 5 years ago

A call makes a lof sense and can happen soon (schedule via email). Be aware that we don't have the answers to all your questions.

mfenner commented 5 years ago

Does Dataverse really need to become a harvesting server for reports in SUSHI format?

In the ideal world yes, but the MDC pilot partners are also not doing this yet. So nothing to worry about right now, but keep this in the back of your head.

mfenner commented 5 years ago

Does Dataverse really need to become a harvesting server for reports in TSV format?

Again, this can happen at some point in the future. DataCite will do a CSV conversion of the reports sent to us in JSON format.

mfenner commented 5 years ago

Why does the CoP refer to SUSHI (JSON) and TSV formats but the "getting started" guide links to DataONE examples in XML?

SUSHI reporting is in JSON and/or TSV. The XML is specific to DataONE, they can explain the reasoning behind it.

mfenner commented 5 years ago

I've emailed Martin about two typos in the CoP and they've been fixed (thanks!) but what's the process for giving more extensive feedback on the CoP?

We haven't sorted out the formal process since COUNTER officially took over maintenance of the Code of Practice a few months ago. It is a good question, we will get back to you about this.

mfenner commented 5 years ago

How do you plan to measure non-HTTP downloads such as via rsync?

The Code of Practice is really agnostic about the protocol, as long as you have log entries with timestamp and useragent information.

mfenner commented 5 years ago

What's the likelihood that there will be audits in the future? Is this something we should warn Dataverse installations about?

This is something that is central to COUNTER for journal articles, but nothing is planned yet for dataset usage stats. My guess is that we will not have that discussion until there is more uptake of the Code of Practice, and until we have figured out a way to do audits that are not too resource-intensive.

pdurbin commented 5 years ago

@mfenner thanks for all the answers above! Sorry, but we have even more questions that we'd like to go over with you and @dlowenberg during the call that starts in 10 minutes. Here's the updated list:

https://docs.google.com/document/d/1MlJqQmPMUJyJn_fGMzmL2WjvcJQeu7146FfqFkzEAlg/edit?usp=sharing

pdurbin commented 5 years ago

We just had a meeting after the meeting and here are the notes: https://docs.google.com/document/d/16zURrRqNVdMQ3hQHc3MrcxNq7dRDQC3lSNWMX8t28WM/edit?usp=sharing

matthew-a-dunlap commented 5 years ago

I have generated two stories to move us forward on the work supporting Make Data Count:

Furthermore, there are some additional bits of investigation that can be done to support this work:

pdurbin commented 5 years ago

The architecture drawing on our whiteboard is so messy I decided to make a diagram of our current direction:

make-data-count

ac5e29d7b is the initial commit of this diagram but I'm sure we'll iterate on it.

Also, sent an email to DataCite report with a subject of "Make Data Count, SUSHI, dashboard, JSON Web tokens" to ask for the JSON Web Token we need to start sending SUSHI JSON to the test DataCite hub.

matthew-a-dunlap commented 5 years ago

For what its worth, I tried switching over our publisher from "client-id" to "dataverse" when creating raw logs for processing by counter processor. The records were processed fine by counter-processor but rejected by Make Data Count. @mfenner Do you have any guidance on what we should put for our records?

Dataverse: ... "dataverse", "publisher-id": [{"type": "", "value": ""}], ... error: 422 {'Date': 'Fri, 18 Jan 2019 22:03:20 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Status': '422 Unprocessable Entity', 'Cache-Control': 'no-cache', 'Vary': 'Accept-Encoding, Origin', 'Content-Encoding': 'gzip', 'X-Runtime': '0.045917', 'X-Credential-Username': 'datacite.harvard', 'X-Request-Id': 'd8e44670-f640-43ef-bb3b-8f6d7de23658', 'X-Powered-By': 'Phusion Passenger 6.0.0', 'Server': 'nginx/1.15.7 + Phusion Passenger 6.0.0'} application/json; charset=utf-8 {"errors":[[{"#/report-datasets/0/publisher-id/0/type":"The property '#/report-datasets/0/publisher-id/0/type' value \"\" did not match one of the following values: isni, orcid, grid, urn, client-id in schema 7757177d-ae02-5888-8cdf-d748b3fb8616#"}]]}

client-id ... "publisher": "client-id", "publisher-id": [{"type": "", "value": ""}], ...

matthew-a-dunlap commented 5 years ago

Hot off the PIDapalooza presses, I had a good conversation with @kjgarza from DataCite answering a few questions outstanding about our MDC integration.

publisher / publisher-id : The client-id option for publisher is the datacite client id, which each installation should have (tho I'm not sure how we store this in dataverse). We should be able to pass the same publisher info for each dataset. That being said, this information is not actually used at this point (MDC gets the info out of its own system instead of trusting ours) so we could spoof this as well.

Updating sushi logs during the month Currently MDC needs the information for all days that have passed in the month on each submission, even if you are submitting daily. In other words, on Jan 9th we have to pass info from Jan 1-9, and then on Jan 10th from Jan 1-10 . This is how counter-processor supports the calls. The issues with MDC not updating the the log entry each day seems to be due to the how we are hacking counter-processor for testing. The first submission to MDC per month should be a POST, and each subsequent call should be a PUT. When we wipe out the state of Counter Processor to test it (likely) sees the log as a new one and does a POST which MDC takes but does not actually use to update. The sashimi readme has more info: https://github.com/datacite/sashimi/blob/master/README.md

Log Processing info There is another available processor for raw logs, written by members of the DataCite team https://github.com/datacite/shiba-inu . It looks like we could pipe our logs into this system as well. I think Counter Processor is a better choice for our production flavor and requires less infrastructure.

pdurbin commented 5 years ago

I just pushed 916bd87a7 to stub out a new "Dataset Metrics" heading in the User Guide (and various other reorg): http://guides.dataverse.org/en/4821-make-data-count/user/dataset-management.html#dataset-metrics

Here's a screenshot:

screen shot 2019-01-28 at 1 32 50 pm

I showed a draft to @dlmurphy and he and others are welcome to make improvements.

pdurbin commented 5 years ago

Latest todo list after some whiteboarding with @matthew-a-dunlap and @sekmiller

matthew-a-dunlap commented 5 years ago

Note: this issue is blocked by https://github.com/IQSS/dataverse/issues/4832 or whatever new issue we find to capture the need to convert all PIDs in Harvard Dataverse to DOIs.

pdurbin commented 5 years ago

I just moved pull request #5329 to code review. Kudos to @sekmiller and @matthew-a-dunlap for all the great work on it!

Questions for reviewers to ponder:

Documentation to review:

dlmurphy commented 5 years ago

I noticed something on http://guides.dataverse.org/en/4821-make-data-count/developers/make-data-count.html that looks like someone meant to go back later and add more detail

Under "Testing Make Data Count and Dataverse":

"The first thing to fix is to clear two files from Counter Processor ..."

Was the idea to mention which two files?

kcondon commented 5 years ago

Issues found: [x] 1. Missing single quote at end of command: http://guides.dataverse.org/en/4821-make-data-count/admin/make-data-count.html

curl -X POST 'http://localhost:8080/api/admin/makeDataCount/:persistentId/addUsageMetricsFromSushiReport?reportOnDisk=/tmp/sushi_sample_logs.json

[x] 2, Should use actual report file name in example above,

curl -X POST 'http://localhost:8080/api/admin/makeDataCount/:persistentId/addUsageMetricsFromSushiReport?reportOnDisk=/tmp/make-data-count-report.json'

[x] 3. Clean up Counter Processor installation instructions to suggest installation directory to agree with admin/dev guide suggestion, and potentially correct geoip db location instructions, since both reference counter-processor-0.0.1 sub dir: http://guides.dataverse.org/en/4821-make-data-count/installation/prerequisites.html Change to the Counter Processor directory. cd /home/counter/counter-processor-0.0.1

[x] 4. Following prereq instructions, pip3 was not installed, needed to be installed separately http://guides.dataverse.org/en/4821-make-data-count/installation/prerequisites.html Decision was this would work for many and an admin would figure it out if not.

[x] 5. Dataverse API to extract citation is not working, per Phil.

[x] 6. In some cases, ip addresses (aws) that cannot be resolved to a country by counter processor and by this site: https://www.ip2location.com/demo result in a blank country code in cp report. It appears this can happen when requests are made from the same machine such as on an aws box that has a private, non routable ip address, not the same as loopback 127.0.0.1. Decision was we can ignore this.

[x] 7. Machine access stats are not imported into db from json report, when no country code present in cp json report. Decision was we can ignore this.

[x] 8. Multiple views or downloads in a short time either via browser or curl get counted only 1ce, though 3 were performed. This includes total and unique. This is due to the 30second double-click detection threshold. All were considered one click.

[x] 9. File downloads are counted both as dataset view and file downloads. This is by design, as described in the spec: The dataset (a collection of data published or curated by a single agent) is the content item for which we report usage in terms of investigations (i.e. how many times metadata are accessed) and requests (i.e. how many times data are retrieved, a subset of all investigations).

[x] 10. Multiple different file downloads from the same dataset results in correct total but a single unique download. This is because a file download, even if a different file, are all considered coming from the same dataset with respect to uniqueness. Discussed with Matthew, as designed.

[x] 11. Multiple different file downloads from the same dataset results in a single file download uri in the json report file. It appears to be grabbing the first url and is not meaningful to us we can ignore it.

[x] 12. Downloading exported metadata from the ui (metadata tab on dataset page) is not logged as an event.

[x] 13. Fetching dataset export metadata via api does not log as an event, using any download metadata api.

[x] 14. Datacite accepts counts that have empty country-counts, dv db does not. We've decided to record events in dv that do not have an identifiable country to be consistent with Datacite.

[x] 15. Download multiple files by checkbox when processed by cp throws error and does not complete. I'm told this path uses a different api: processing sample_logs/counter_2019-02-27.log Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/peewee.py", line 2484, in execute_sql cursor.execute(sql, params or ()) sqlite3.IntegrityError: metadataitem.identifier may not be NULL

[x] 16. Export metadata, either via UI or API logs events but does not result in counts in cp json report.

[x] 17. Export UI metadata, formats OAI_ORE and schema.org JASON-LD are not logged.

[x] 18. Download Dataset Metadata as Json native api endpoint is not logged.

[x] 19. Native API List files in a dataset endpoint does not log event.

  1. Cannot upload json to hub, get expected to upload, but got code 500 @pdurbin @matthew-a-dunlap suggested you might help on this one? Tried again today, still 500 on connect to datacite using secrets.yaml. Tried command line and env variable method too but fails silently. Here is cp trace:
    Writing JSON report to /tmp/make-data-count-report.json
    expected to upload, but got code 500
    expected to upload, but got code 500
    expected to upload, but got code 500
    expected to upload, but got code 500
    expected to upload, but got code 500
    expected to upload, but got code 500
    expected to upload, but got code 500
    expected to upload, but got code 500
    ^CTraceback (most recent call last):
    File "main.py", line 45, in <module>
    upload.send_to_datacite()
    File "/home/counter/counter-processor-0.0.1/upload/upload.py", line 50, in send_to_datacite
    response = retry_if_500(method='post', url=my_url, data=data, headers=headers)
    File "/home/counter/counter-processor-0.0.1/upload/upload.py", line 33, in retry_if_500
    time.sleep(1)
    KeyboardInterrupt

[x] 21. Calling export api using localhost logs twice, both as regular and as machine. If use dns name, logs correctly as only machine. This appears to be a bug in Counter Processor. Note: I did not see this behavior when calling download dataset metadata as json with localhost. Update: Cannot reproduce. Appears to work correctly.

[x] 22. Need cron jobs for operational config, with notification on error and steps to fix. This may be out of scope for this issue and need a separate ticket. Update, opened as a separate issue: https://github.com/IQSS/dataverse.harvard.edu/issues/3

Left to test: [x] -Check whether unpublished access, or draft access is counted No on both [x] -Check whether blank country sent to datacite fails entire report yes, works and is country/present in report from datacite. [x] -Check whether blank country counts toward total count in dv db metrics no, they are not, tested alone and mixed with country entries [x] -Test hdls with post to hub off [x] -Load test data flow with lots of data [x] -Check metrics api options, eg. country, date, other [x] -Retest: ui vs api, view vs. download, regular vs machine, for dataset, metadata, file, multi file, export. Also account for double click (30sec), unique user (1hr), country/no country, localhost/private ip

pdurbin commented 5 years ago

@kcondon I made some of the doc improvements we talked about in 44acd971d

I'm still not quite sure where the json SUSHI report should be saved so I didn't change anything there. Also, as we discussed the log files are written by glassfish but need to be read by counter. I'm open to suggestions about which directories to use. Maybe we can chat a bit more about it.

pdurbin commented 5 years ago

@kcondon ok in 51cfde87e I tried to reconcile the config with the guides so they match.

pdurbin commented 5 years ago

@djbrooke you asked me to leave a couple code comments of decisions made during tech hours and I just did in 1b527aa09 . Can you please take a look?

djbrooke commented 5 years ago

Looks good, thanks. @kcondon was re-verifying a few things and doing some further testing so I'm moving this back to QA. Thanks all for the discussion at tech hours.

pdurbin commented 5 years ago

As discussed with @kcondon @sekmiller and @djbrooke we plan to revert 1b527aa and start views and downloads even when Counter Processor can determine a country based on the IP address (127.0.0.1, 192.168.0.1, 172.16.0.1, 10.0.0.1, etc.). Primarily we decided this because the DataCite hub accepts reports without countries and we don't want the metrics we store in Dataverse to be out of sync with the DataCite hub.

matthew-a-dunlap commented 5 years ago

We are waiting on a new api token to complete testing of this story.