Closed adam3smith closed 3 years ago
As noted in https://github.com/QualitativeDataRepository/TechnicalTeam/wiki/11-23-2020-Tech-Team-Report, DataCite doesn't have an API to get the failed resolutions, only shows the top-10 in the web page, and is currently 4 months behind in showing them. That said, the underlying HTML file for the report isn't very complex and it should be easy to find the table row for GDCC.SYR-QDR and extract the top-10 resolution failures.
For the Dataverse side, I think it will be easier to make Dataverse log resolution failures separately than to parse the logs.
Dev has the new reporting and I have a basic python script to send email with any failures. The script isn't yet filtering out 'obviously bad' PIDs and I wanted to check the rules:
Any other authorities to watch for?
The script is currently set to run on the first of the new month and send the report for the completed prior month. It could send an interim report, or I can adjust to send the prior ~30 days on the 15th, etc. if that's better (slightly more complex than just using the month boundaries).
Basic format is below. Easy to change.
Subject: PID Failure Report for 2020-11
Hits DOI URI
(Note: clicking links will record new failures unless these are drafts)
6 doi:10.33564/FK27U7YBVA https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.33564/FK27U7YBVA
2 doi:10.33564/FK2PP7U7Y https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.33564/FK2PP7U7Y
1 doi:10.33564/FK27U7YBVQ https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.33564/FK27U7YBVQ
Details:
doi:10.33564/FK27U7YBVA
GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T00:05:53+0000
GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T00:05:53+0000
GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:06+0000
GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:06+0000
GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:12+0000
GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:12+0000
doi:10.33564/FK2PP7U7Y
GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:26+0000
GET /dataset.xhtml from 172.31.39.206 at 2020-11-25T18:59:26+0000
doi:10.33564/FK27U7YBVQ
GET /api/v1/datasets/:persistentId from 172.31.39.206 at 2020-11-25T00:05:39+0000
Parsing the DataCite file gives a report like the following examples (when there are one or more months of new info, or when there are no new reports posted). Changes in their HTML will break the report, but the parsing is fairly easy (they separate a large html representing the table of all results from the interactive part of the webpage). Note the script reads the file, unzips it, and creates a temporary ~6MB file that it then parses looking for the GDCC.SYR-QDR section.
Report for 05_2020
Hits DOI URI
(Note: clicking links will record new failures unless these are drafts)
10 doi:10.5064/F6UURYON https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6UURYON
6 doi:10.5064/FLAT https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/FLAT
2 doi:10.5064/ABC https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/ABC
2 doi:10.5064/ https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/
2 doi:10.5064/F6TB14TB https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6TB14TB
2 doi:10.5064/F6NNOI5C/ https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6NNOI5C/
2 doi:10.5064/------------------- https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/-------------------
2 doi:10.5064/F6UMRNAC).%20WHEN%20THE%20AMERICAN%20GOVERNMENT%20WAS%20SERIOUS%20ABOUT%20PROMOTING%20OUTWARD%20INVESTMENT,%20OFFICIALS%20DISCUSSED%20OTHER%20INITIATIVES,%20LIKE%20TARGETED%20TAX%20CONCESSIONS%20OR%20POLITICAL%20RISK%20INSURANCE.%3C/P%3E%3CP%3EYET%20A%20BELIEF%20THAT%20INVESTMENT%20TREATIES%20PROMOTED%20INVESTMENT%20WAS%20CULTIVATED,%20ESPECIALLY%20AMONG%20POTENTIAL%20TREATY%20PARTNERS.%20AFTER%20MEETINGS%20IN%20SIX%20EUROPEAN%20COUNTRIES%20IN%201977,%20AN%20AMERICAN%20OFFICIAL%20MARVELED%20AT%20THE%20DUAL%20MESSAGES:%20%E2%80%9CTHE%20%E2%80%98PROTECTION%E2%80%99%20FUNCTION%20OF%20THE%20%5BTREATIES https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6UMRNAC).%20WHEN%20THE%20AMERICAN%20GOVERNMENT%20WAS%20SERIOUS%20ABOUT%20PROMOTING%20OUTWARD%20INVESTMENT,%20OFFICIALS%20DISCUSSED%20OTHER%20INITIATIVES,%20LIKE%20TARGETED%20TAX%20CONCESSIONS%20OR%20POLITICAL%20RISK%20INSURANCE.%3C/P%3E%3CP%3EYET%20A%20BELIEF%20THAT%20INVESTMENT%20TREATIES%20PROMOTED%20INVESTMENT%20WAS%20CULTIVATED,%20ESPECIALLY%20AMONG%20POTENTIAL%20TREATY%20PARTNERS.%20AFTER%20MEETINGS%20IN%20SIX%20EUROPEAN%20COUNTRIES%20IN%201977,%20AN%20AMERICAN%20OFFICIAL%20MARVELED%20AT%20THE%20DUAL%20MESSAGES:%20%E2%80%9CTHE%20%E2%80%98PROTECTION%E2%80%99%20FUNCTION%20OF%20THE%20%5BTREATIES
2 doi:10.5064/LEGGINGS https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/LEGGINGS
2 doi:10.5064/F6TB14TP%E2%80%8B) https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6TB14TP%E2%80%8B)
Report for 06_2020
Hits DOI URI
(Note: clicking links will record new failures unless these are drafts)
6 doi:10.5064/F6HY9R1Z https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6HY9R1Z
5 doi:10.5064/F6EOZGLB https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6EOZGLB
5 doi:10.5064/F6C8VUHP https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6C8VUHP
3 doi:10.5064/959284 https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/959284
2 doi:10.5064/ABCD https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/ABCD
2 doi:10.5064/F6TLRTD9 https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6TLRTD9
2 doi:10.5064/F6G44N6%20S https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F6G44N6%20S
2 doi:10.5064/20831862.1059210 https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/20831862.1059210
2 doi:10.5064/F68G8HMM. https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/F68G8HMM.
2 doi:10.5064/01.3001.0010.7539 https://dv.dev-aws.qdr.org/dataset.xhtml?persistentId=doi:10.5064/01.3001.0010.7539
or
No new monthly reports from DataCite. Next report expected: 07_2020
- must start with "10.5064/F6" ? and
- include exactly 6 characters after that?
seems right, yes. I don't think there's any need to make this unnecessarily restrictive -- I just want obvious junk filtered out to not be distracting
For the emails -- this would be a single email or two? (either is fine) For Datacite, could the report be preceded by an explanatory sentence?
Attempts to resolve plausible QDR DOIs as reported by Datacite
Reports are now running on prod monthly and sending email to multiple recipients. For now, there is no filtering and all DOIs are shown. DataCite still has not updated statistics since June, but the first report after they update should include all months that were added. Dataverse now creates a new monthly log reporting PID failures. After the initial deployment, changes were made to this log to escape \t\r\n chars (things that break being able to parse the log as a tsv file. Added the scripts to our dataverse repo (with emails/username/pass removed) in the conf/qdr/pidreporting dir. I have local copies with the sender and receiver emails and we have encrypted copies of the AWS credentials needed.
Unless further changes are desired, this should run monthly without any intervention. We may want to manually check DataCite periodically to see if they've updated. (If they change their file format when they update new stats, presumably our script will break rather than just not reporting the new info, but some changes might just look like no new results.)
We want to capture both unpublished/non-existent DOIs that are linked to and unpublished URLs, so we need two different data sources:
Let's try to get something approximately along these lines:
10.5064/F68G8HMM.
and10.5064/ABCD
in the Datacite resolution logs -- those aren't relevant to us)