NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Parser Fix]: Adjust DDE's includedInDataCatalog field #117

Open gtsueng opened 8 months ago

gtsueng commented 8 months ago

Background: In the future, there will be two portals through which metadata records will be ingested into the DDE. The crawler should be updated to reflect this change. We need to be able to distinguish records coming from the NIAID SysBio portal vs records coming from NDE portal.

Rational: There may be multiple portals added to the DDE in the future for which the records will be ingested into the NDE. We need to be able to filter for datasets that specifically came from the NIAID SysBio portal and distinguish it from datasets coming from a different DDE portal. At the same time, we need to be able to identify all datasets coming from the NDE (of which the NIAID SysBio, and other portals may be a part.)

Potential approach

gtsueng commented 5 months ago

@jal347 The following records in the DDE were added to the newly created NDE portal on the DDE and can be used to check the parser changes:

The following properties are in the ResourceCatalog Schema but are NOT in the Dataset schema for the NDE, so the ES mapping may need to be revisited:

gtsueng commented 5 months ago

Everything is functioning well. Note that we are waiting to finalize the formatting of DefinedTerms from the DDE entry point. This may or may not affect how the data is parsed, which is why this issue has yet to be closed.

gtsueng commented 4 months ago

@jal347 The ResourceCatalog schema has been updated to have 2 additional recommended properties that will need to be mapped:

One thing to note is that all the DefinedTerm fields ('infectiousAgent', 'species', 'healthCondition', 'variableMeasured', 'measurementTechnique' are being treated as though PubTator is the curator. This should only be the case if the any of the string values are NOT uri's. If they are URI's, the term was curated by whoever submitted the data to the DDE, so set the curator to be the DDE.

Related issue: https://github.com/NIAID-Data-Ecosystem/nde-portal/issues/222

gtsueng commented 4 months ago

Please see: https://github.com/NIAID-Data-Ecosystem/nde-crawlers/issues/124 For a detailed explanation and example of the issue mentioned in the previous comment.

gtsueng commented 4 months ago

The includedInDataCatalog.name appears to be correct, but the includedInDataCatalog.url appears to be swapped. image

This is causing a mismatch between the name displayed and the logo

gtsueng commented 4 months ago

The improvements are now on staging. They will not be seen in production as NDE/DDE-ingested resource catalogs are not yet approved for production

gtsueng commented 1 month ago

The incorrect metadata does not appear to have been fixed. @jal347 please address and do a fresh pull/build.

For example, the top result for this search: https://data.niaid.nih.gov/search?q=BV-BRC appears to be correct, but the second results (the Dataset) has an

"includedInDataCatalog" : [{"name": Data Discovery Engine", "url": "https://discovery.biothings.io/dataset/55508b2aa74742a8"},{"name": "NDE Systems Biology", "url": "https://discovery.biothings.io/dataset/55508b2aa74742a8"}]

This is incorrect as this Dataset was submitted through the NIAID Systems Biology site

Rather than what it has currently, it should be"

"includedInDataCatalog" : [{"name": Data Discovery Engine", "url": "https://discovery.biothings.io/dataset/55508b2aa74742a8"},{"name": "Data Discovery Engine, NIAID Systems Biology", "url": "https://discovery.biothings.io/dataset/55508b2aa74742a8"}, {"name": "Data Discovery Engine, NIAID Data Ecosystem", "url": "https://discovery.biothings.io/dataset/55508b2aa74742a8"}]

This way, if we filter by "includedInDataCatalog.name" = "Data Discovery Engine, NIAID Data Ecosystem", we get everything submitted via NDE and NIAID SysBio portals, but if we filter by "includedInDataCatalog.name" = "Data Discovery Engine, NIAID Systems Biology", we only get the records submitted via the NIAID SysBio portal

gtsueng commented 1 month ago

It looks like what's on staging is behaving correctly. Please updated the fixes from staging to production

gtsueng commented 4 weeks ago

Note that a record has finally been submitted via the CREID portal in the DDE

@jal347 , please update the DDE parser such that if the '@context' for a DDE-ingested Dataset includes "creid": "https://discovery.biothings.io/view/creid/",

the includedInDataCatalog field is updated to include:

[
  {
    "@type": "DataCatalog",
    "name": "NIAID CREID Network", 
    "url": <url>
  }, 
  {
    "@type": "DataCatalog",
     "name": "NIAID Data Ecosystem", 
     "url": <url>
  }
]
gtsueng commented 4 weeks ago

An example dataset ingested via CREID portal: https://discovery.biothings.io/api/dataset/e2730ebc22a0f38a