MaayanLab / FAIRshake

https://fairshake.cloud
Other
11 stars 6 forks source link

Automated checks in FAIRshake #155

Closed karstenpeters closed 3 years ago

karstenpeters commented 3 years ago

Hi,

while doing a number of tests on datasets hosted in our repository (WDCC, World Data Center for Climate), I have come across some issues regarding the automatic parts of the evaluation. I am using the "FAIRshake dataset rubric" for assessment.

I will take the following dataset as example:

https://cera-www.dkrz.de/WDCC/ui/cerasearch/entry?acronym=ssp585_r2i1p1f1-eh6_rcm_c6

The json-ld part is seen here:

https://cera-www.dkrz.de/WDCC/ui/cerasearch/entry?acronym=ssp585_r2i1p1f1-eh6_rcm_c6&exporttype=json-ld

The json-ld metadata comply with the schema.org standard.

1) Dataset identification: The json-ld contains ample information regarding the identification of the dataset, however, the first test fails, with the message "json-ld WebSite.identifier DataCatalog.identifier Dataset.identifier not found"

2) Dataset access: Access to datasets hosted in WDCC is free of charge. We do however require authentication in order to access our datasets, which immediately implies that we do not provide URLs to the datasets in the json-ld for practical purposes (some of the datasets are >100TB in size). However, the json-ld metadata do contain the information "isAccessibleForFree": true". Nevertheless, the FAIRshake test for "The dataset can be downloaded for free from the repository" fails. Is it strictly required for FAIRshake to actually be able to download the data or would the information we provide in the json-ld also suffice to automatically pass that test?

3) The json-ld contains a long list of dataset creators (albeit with no email attached), and the landing page also contains a "contacts" tab. However, the FAIRshake test "Contact information is provided for the creator(s) of the dataset." does not automatically recognise this information.

4) We clearly provide citation information on the landing page of the dataset, although the citation only refers to the parent project. So in this case, the dataset can only be cited as part of a collection - which is a viable approach. This information is also somewhat contained in the json-ld: ""isPartOf": [ { "@type": "Dataset", "@id": "https://doi.org/10.26050/WDCC/RCM_CMIP6_SSP585-HR_r2i1p1f1", "name": "CMIP6 ScenarioMIP DWD MPI-ESM1-2-HR ssp585_r2i1p1f1 - RCM-forcing data" } ]," However, the FAIRshake test "Information is provided describing how to cite the dataset." does not recognise this information.

5) Licensing: We also provide licensing information on the landing page of the dataset. On the landing page, this is called "use constraints", but it refers to CC-BY 4.0. In the json-ld, the syntax is as follows: ""license": "https://creativecommons.org/licenses/by/4.0/"," However, the FAIRshake test "Licensing information is provided on the dataset’s landing page." does not detect any licensing information and yields "No" as result.

The good thing about FAIRshake is, that all the answers which were not properly filled by the algorithm can still be amended manually. But it would of course be more attractive, if FAIRshake would recognise the information provided on both the landing page and the json-ld.

I hope this may help in improving FAIRshake.

Thanks very much, Karsten

u8sand commented 3 years ago

Karsen,

Thank you very much for your report; we'll certainly try to address these things. In the meantime, here are some 'answers' to the reasons some of these may be failing.

  1. Dataset identification: as you see from the comment -- we're currently looking the identifier property; however, using the doi in the @id field seems like it should be valid as well.
  2. We do check for the isAccessibleForFree property, so this will work shortly *.
  3. Contact was originally looking for a field which actually seems out of the spec contact -- but instead it should probably be looking for a ContactPoint. It seems clear that certain information attached to an Organization should also be suitable for this metric. However contact information on a webpage is not something that can be automatically asserted, the automated assessments are currently restricted to machine-readable contracts, this webpage might make sense attached to a ContactPoint (i.e. { "@type": "ContactPoint", "url": ... }).
  4. The citation check specifically looks for the citation property. Again the DOI can be used for this purpose; I'm still unsure whether or not it makes sense to waive the citation requirement in the presence of a DOI, and may investigate i.e. Google Dataset Search to see if the lack of citation has implications for Findability.
  5. This license information should also have been detected and will work shortly *.

*I suspect the ones that should be working that aren't is because of the @context being a list, but this is valid json-ld syntax so this will certainly be addressed.

Update:

u8sand commented 3 years ago

With no further action items, I'll close this issue. Do let us know if you have any further questions.