NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

Import metadata from PMC #37

Open flaneuse opened 2 years ago

flaneuse commented 2 years ago

Pull all publication metadata from PMC OAI-PMH or from bulk open access data or APIs.

https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&identifier=oai:pubmedcentral.nih.gov:8313480&metadataPrefix=pmc

Using this data, there's a number of things that can be done:

  1. Create new datasets from the supplementary materials files

    Screen Shot 2022-07-07 at 4 50 35 PM
  2. Create Dataset -> Publication -> Grant linkages for the existing datasets in the NDE. If a dataset has a citation listed, augment the existing metadata by attaching the funding provided by PMC.

  3. Parse data availability statements to figure out where the data is and link to existing datasets.

    Screen Shot 2022-07-07 at 4 50 35 PM copy
  4. Use regex parsing to mine the text of the document to find similar linkages between a small subset of repos w/ consistent identifiers and publications. Ideally would disambiguate between primary citations (data generation) and secondary citations (publications which reuse the data).

  5. Add additional Dataset -> Publication -> Grant linkages via PMC "related information" structured metadata

    Screen Shot 2022-07-07 at 4 50 30 PM