NCEAS / metadig-checks

MetaDIG suites and checks for data and metadata improvement and guidance.
Apache License 2.0
8 stars 9 forks source link

resource.URLs.resolvable failing for private metadata documents #437

Open gothub opened 2 years ago

gothub commented 2 years ago

This check verifies if the metadata identifier is a resolvable URL. In the case that the identifier is a bare string (not starting with http: or doi:), then the DataONE resolve service is called for the identifier. The check will pass if the identifier resolves with the DataONE service.

If the DataONE id is private, the check will fail. The metadig-engine has privilege to read the metadata and sysmeta for the pid, bug the checks themselves are run by the engine without privilege. Therefore an HTTP 401 status is returned.

For DataONE ids only, should the check detect 401 status and return a message such as:

The identifier <id here> is resolvable using the DataONE resolve service, but is private

... or something else?

gothub commented 2 years ago

@mbjones @vchendrix @JEDamerow
what are your thoughts on this?

gothub commented 2 years ago

If the identifier is resolvable via the DataONE resolve service, but is not publicly readable, the message is printed:

The metadata identifier was found and is resolvable using the DataONE resolve service, but is not publicly readable.

This change was made in commit f1ac27e8ce280fb221503d224de0ac20279483ee

gothub commented 2 years ago

With the ess-dive-1.1.0 suite release in production:

It appears that this fix is not working in production, see https://data.ess-dive.lbl.gov/quality/ess-dive-7fc993dad587390-20220324T214014280. This pid does exist, but is not readable, as the following URL produces the msg shown below:

https://cn.dataone.org/cn/v2/meta/ess-dive-7fc993dad587390-20220324T214014280
<error detailCode="1040" errorCode="401" name="NotAuthorized">
<description>READ not allowed on ess-dive-7fc993dad587390-20220324T214014280 for subject[s]: public; </description>
</error>

So this check should pass, but it is marked as failed in the assessment report.

Also, as this is the only 'Assessment' type report (from FAIR categories), the display of the assessment report does not show a 'progress bar' at the top of the report for the 'Assessment' category. (File a separate metacatui issue for this).

emilyarobles commented 2 years ago

The URL https://doi.org/10.1002/2017WR020471 included in the metadata of data package https://data.ess-dive.lbl.gov/view/ess-dive-c4d31960b81d845-20220406T213905525 is causing a failed metadata.URLs.resolvable check. The DOI is resolvable and correctly points to a dataset landing page, so this check should pass.

emilyarobles commented 2 years ago

Private dataset submitted for publication https://data.ess-dive.lbl.gov/quality/ess-dive-838a0b1a47f1695-20220414T225815875 is incorrectly failing the metadata.identifier.resolvable check. This dataset utilizes the "sameAs" external linking relationship. The Accessible check category is not showing on the assessment report.

gothub commented 2 years ago

@emilyarobles @val the URL mentioned above resolves to https://agupubs.onlinelibrary.wiley.com/doi/10.1002/2017WR020471. It appears that this website disallows web user agents such as Python and R to send HTTP "Head" requests to see if a web page exists. This is what I think is happening when the assessment check tries to check the DOI:

I'm open to suggestions on how to proceed with resolving this issue.

gothub commented 2 years ago

@emilyarobles @Val At the 2022-04-19 tech meeting, it was recommended to attempt to modify the check so that if the metadata id is a DOI URL and a 503 status is returned, then attempt to just check if the URL is a valid registered DOI. I will check if doi.org supports this type of query.

The text returned by the check would then be:

gothub commented 2 years ago

Note that the check that is actually being run in the ESS-DIVE suite is named 'resource.URLs.resolvable'. The initial issue for this check is here.

emilyarobles commented 2 years ago

@gothub The following links are incorrectly being flagged as unresolvable:

Dataset 1: https://doi.org/10.1021/acsearthspacechem.2c00031

Dataset 2: https://earthdata.nasa.gov; https://daymet.ornl.gov

Dataset 3: https://doi.org/10.1890/12-1243.1; https://doi.org/10.1890/13-1313.1; https://doi.org/10.1038/ismej.2016.122

Dataset 4: https://doi.org/10.1890/09-0889.1

vchendrix commented 2 years ago

@emilyarobles @charuleka @JEDamerow We have discussed this issue in the ESS-DIVE/NCEAS tech meeting and it is recommended that we move the get URL checks to be warnings when they fail as we cannot control how publishers response to programmatic access of their publications pages. Ping me if you would like to discuss in person.

JEDamerow commented 2 years ago

Ok, sounds reasonable to me.

jeanetteclark commented 1 year ago

circling back to this issue based on some feedback from @JEDamerow. Unfortunately there isn't an easy fix for the DOI urls. Based on the comment above, I'll change the check to optional, as well as add a few more response codes to the passing list.