Coleridge-Initiative / RCDatasets

Creative Commons Zero v1.0 Universal
3 stars 2 forks source link

Cross-link ADRF dataset identifiers #207

Closed ceteri closed 4 years ago

ceteri commented 4 years ago

We need to cross-link our ADRF dataset identifiers to the corresponding entries in datasets.json -- for example, see datasets that include adrf_title fields.

There is a potential blocker in terms of scope, since some of the ADRF datasets are subsets of the datasets described here.

Probably best to handle this in three steps:

  1. @ceteri and @claytonrsh work through a sample of edge cases, to determine what our policy will be for handling scope/overlap
  2. someone from RC team takes this issue to get coverage for the other cases.
  3. confirm with @grahamulator that all of the datasets in the Data Stewardship module are available in datasets.json with the necessary identifiers.
ernestogimeno commented 4 years ago
  1. @ceteri and @claytonrsh work through a sample of edge cases, to determine what our policy will be for handling scope/overlap

We discussed this with @ceteri and Julia first, and then with @ceteri and @grahamulator. We resolved to add a field in ADRF database to group "families" of datasets together using RC dataset_id data

I'm working on step 2.

ernestogimeno commented 4 years ago

Step 2 finished. I uploaded the links to the Dataset Audit List spreadsheet as agreed with the ADRF team.

I linked 99 out of 135 ADRF dataset. Plus another 17 with less probability to be links to be reviewed.

@ceteri Should I close this task or should I wait until those 17 cases are reviewed and then we try to import the missing datasets. Also we have a json file with 184 publications linked to ADRF datasets. Should we try to import them?

ceteri commented 4 years ago

Nice work! If I understand correctly, there are ~35 items in the spreadsheet labeled Unknown and 17 of those are unique cases?

If so, yes let's add those to datasets.json so that Graham has full coverage in the next index. And yes, it's great to add the 184 publications as a new partition.

ernestogimeno commented 4 years ago

The 35 items (ADRF datasets) with Unknown label are all unique ADRF datasets. From those, I labeled 17 as a probable match with some RC datasets, plus a few more that I think are unlikely to be a match but it could be. There are in total 23 ADRF datasets with a proposed link that Clayton now is reviewing here: https://docs.google.com/spreadsheets/d/1AEdnI-HjeVTWYNG13s92GDLpIct8GqYQPdyrQjzhBLg/edit#gid=623525736

I also asked him to identify which of those unlinked ADRF datasets should not be added to RC. Clayton is documenting some edge cases using comments in that sheet.

When I have that list of ADRF datasets that should be included in datasts.json, can I use the metadata from ADRF or should I try to find it online?

ceteri commented 4 years ago

The datasets described in ADRF should have much of what's needed, although it's probably best to double-check by searching online for the new ones being added.

ernestogimeno commented 4 years ago

The crosslink between ADRF datasets and RC datasets (using UUID) is uploaded in the Dataset Audit List the ADRF team is using to clean up the dataset catalog.