QualitativeDataRepository / dataverse

A data repository framework to share and publish research data.
http://dataverse.org
Other
4 stars 1 forks source link

Periodically scan access & resolution logs for requested but non-existent pages and DOIs #50

Closed adam3smith closed 3 years ago

adam3smith commented 4 years ago

We want to capture both unpublished/non-existent DOIs that are linked to and unpublished URLs, so we need two different data sources:

  1. For DOIs, rely on Datacite's resolution reports: https://stats.datacite.org/resolutions.html
  2. For URLs query our Dataverse's log

Let's try to get something approximately along these lines:

qqmyers commented 3 years ago

As noted in https://github.com/QualitativeDataRepository/TechnicalTeam/wiki/11-23-2020-Tech-Team-Report, DataCite doesn't have an API to get the failed resolutions, only shows the top-10 in the web page, and is currently 4 months behind in showing them. That said, the underlying HTML file for the report isn't very complex and it should be easy to find the table row for GDCC.SYR-QDR and extract the top-10 resolution failures.

For the Dataverse side, I think it will be easier to make Dataverse log resolution failures separately than to parse the logs.

qqmyers commented 3 years ago

Dev has the new reporting and I have a basic python script to send email with any failures. The script isn't yet filtering out 'obviously bad' PIDs and I wanted to check the rules:

Any other authorities to watch for?

The script is currently set to run on the first of the new month and send the report for the completed prior month. It could send an interim report, or I can adjust to send the prior ~30 days on the 15th, etc. if that's better (slightly more complex than just using the month boundaries).

Basic format is below. Easy to change.

Subject: PID Failure Report for 2020-11

Hits    DOI URI
(Note: clicking links will record new failures unless these are drafts)

6   doi:10.33564/FK27U7YBVA https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.33564/FK27U7YBVA
2   doi:10.33564/FK2PP7U7Y  https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.33564/FK2PP7U7Y
1   doi:10.33564/FK27U7YBVQ https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.33564/FK27U7YBVQ

Details:

doi:10.33564/FK27U7YBVA
    GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T00:05:53+0000

    GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T00:05:53+0000

    GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:06+0000

    GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:06+0000

    GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:12+0000

    GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:12+0000

doi:10.33564/FK2PP7U7Y
    GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:26+0000

    GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:26+0000

doi:10.33564/FK27U7YBVQ
    GET /api/v1/datasets/:persistentId from 172.31.39.206 at 2020-11-25T00:05:39+0000
qqmyers commented 3 years ago

Parsing the DataCite file gives a report like the following examples (when there are one or more months of new info, or when there are no new reports posted). Changes in their HTML will break the report, but the parsing is fairly easy (they separate a large html representing the table of all results from the interactive part of the webpage). Note the script reads the file, unzips it, and creates a temporary ~6MB file that it then parses looking for the GDCC.SYR-QDR section.

Report for 05_2020

Hits    DOI URI
(Note: clicking links will record new failures unless these are drafts)

10  doi:10.5064/F6UURYON    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6UURYON
6   doi:10.5064/FLAT    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/FLAT
2   doi:10.5064/ABC https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/ABC
2   doi:10.5064/    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/
2   doi:10.5064/F6TB14TB    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6TB14TB
2   doi:10.5064/F6NNOI5C/   https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6NNOI5C/
2   doi:10.5064/------------------- https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/-------------------
2   doi:10.5064/F6UMRNAC).%20WHEN%20THE%20AMERICAN%20GOVERNMENT%20WAS%20SERIOUS%20ABOUT%20PROMOTING%20OUTWARD%20INVESTMENT,%20OFFICIALS%20DISCUSSED%20OTHER%20INITIATIVES,%20LIKE%20TARGETED%20TAX%20CONCESSIONS%20OR%20POLITICAL%20RISK%20INSURANCE.%3C/P%3E%3CP%3EYET%20A%20BELIEF%20THAT%20INVESTMENT%20TREATIES%20PROMOTED%20INVESTMENT%20WAS%20CULTIVATED,%20ESPECIALLY%20AMONG%20POTENTIAL%20TREATY%20PARTNERS.%20AFTER%20MEETINGS%20IN%20SIX%20EUROPEAN%20COUNTRIES%20IN%201977,%20AN%20AMERICAN%20OFFICIAL%20MARVELED%20AT%20THE%20DUAL%20MESSAGES:%20%E2%80%9CTHE%20%E2%80%98PROTECTION%E2%80%99%20FUNCTION%20OF%20THE%20%5BTREATIES   https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6UMRNAC).%20WHEN%20THE%20AMERICAN%20GOVERNMENT%20WAS%20SERIOUS%20ABOUT%20PROMOTING%20OUTWARD%20INVESTMENT,%20OFFICIALS%20DISCUSSED%20OTHER%20INITIATIVES,%20LIKE%20TARGETED%20TAX%20CONCESSIONS%20OR%20POLITICAL%20RISK%20INSURANCE.%3C/P%3E%3CP%3EYET%20A%20BELIEF%20THAT%20INVESTMENT%20TREATIES%20PROMOTED%20INVESTMENT%20WAS%20CULTIVATED,%20ESPECIALLY%20AMONG%20POTENTIAL%20TREATY%20PARTNERS.%20AFTER%20MEETINGS%20IN%20SIX%20EUROPEAN%20COUNTRIES%20IN%201977,%20AN%20AMERICAN%20OFFICIAL%20MARVELED%20AT%20THE%20DUAL%20MESSAGES:%20%E2%80%9CTHE%20%E2%80%98PROTECTION%E2%80%99%20FUNCTION%20OF%20THE%20%5BTREATIES
2   doi:10.5064/LEGGINGS    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/LEGGINGS
2   doi:10.5064/F6TB14TP%E2%80%8B)  https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6TB14TP%E2%80%8B)

Report for 06_2020

Hits    DOI URI
(Note: clicking links will record new failures unless these are drafts)

6   doi:10.5064/F6HY9R1Z    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6HY9R1Z
5   doi:10.5064/F6EOZGLB    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6EOZGLB
5   doi:10.5064/F6C8VUHP    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6C8VUHP
3   doi:10.5064/959284  https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/959284
2   doi:10.5064/ABCD    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/ABCD
2   doi:10.5064/F6TLRTD9    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6TLRTD9
2   doi:10.5064/F6G44N6%20S https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6G44N6%20S
2   doi:10.5064/20831862.1059210    https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/20831862.1059210
2   doi:10.5064/F68G8HMM.   https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F68G8HMM.
2   doi:10.5064/01.3001.0010.7539   https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/01.3001.0010.7539

or

No new monthly reports from DataCite. Next report expected: 07_2020

adam3smith commented 3 years ago
  • must start with "10.5064/F6" ? and
  • include exactly 6 characters after that?

seems right, yes. I don't think there's any need to make this unnecessarily restrictive -- I just want obvious junk filtered out to not be distracting

For the emails -- this would be a single email or two? (either is fine) For Datacite, could the report be preceded by an explanatory sentence?

Attempts to resolve plausible QDR DOIs as reported by Datacite

qqmyers commented 3 years ago

Reports are now running on prod monthly and sending email to multiple recipients. For now, there is no filtering and all DOIs are shown. DataCite still has not updated statistics since June, but the first report after they update should include all months that were added. Dataverse now creates a new monthly log reporting PID failures. After the initial deployment, changes were made to this log to escape \t\r\n chars (things that break being able to parse the log as a tsv file. Added the scripts to our dataverse repo (with emails/username/pass removed) in the conf/qdr/pidreporting dir. I have local copies with the sender and receiver emails and we have encrypted copies of the AWS credentials needed.

Unless further changes are desired, this should run monthly without any intervention. We may want to manually check DataCite periodically to see if they've updated. (If they change their file format when they update new stats, presumably our script will break rather than just not reporting the new info, but some changes might just look like no new results.)