Open gtsueng opened 8 months ago
@jal347 The following records in the DDE were added to the newly created NDE portal on the DDE and can be used to check the parser changes:
The following properties are in the ResourceCatalog Schema but are NOT in the Dataset schema for the NDE, so the ES mapping may need to be revisited:
Everything is functioning well. Note that we are waiting to finalize the formatting of DefinedTerms from the DDE entry point. This may or may not affect how the data is parsed, which is why this issue has yet to be closed.
@jal347 The ResourceCatalog schema has been updated to have 2 additional recommended properties that will need to be mapped:
hasDownload
: A text field propertyhasAPI
: A boolean propertyOne thing to note is that all the DefinedTerm fields ('infectiousAgent', 'species', 'healthCondition', 'variableMeasured', 'measurementTechnique' are being treated as though PubTator is the curator. This should only be the case if the any of the string values are NOT uri's. If they are URI's, the term was curated by whoever submitted the data to the DDE, so set the curator to be the DDE.
Related issue: https://github.com/NIAID-Data-Ecosystem/nde-portal/issues/222
Please see: https://github.com/NIAID-Data-Ecosystem/nde-crawlers/issues/124 For a detailed explanation and example of the issue mentioned in the previous comment.
The includedInDataCatalog.name
appears to be correct, but the includedInDataCatalog.url
appears to be swapped.
This is causing a mismatch between the name displayed and the logo
The improvements are now on staging. They will not be seen in production as NDE/DDE-ingested resource catalogs are not yet approved for production
The incorrect metadata does not appear to have been fixed. @jal347 please address and do a fresh pull/build.
For example, the top result for this search: https://data.niaid.nih.gov/search?q=BV-BRC appears to be correct, but the second results (the Dataset) has an
"includedInDataCatalog" : [{"name": Data Discovery Engine", "url": "https://discovery.biothings.io/dataset/55508b2aa74742a8"},{"name": "NDE Systems Biology", "url": "https://discovery.biothings.io/dataset/55508b2aa74742a8"}]
This is incorrect as this Dataset was submitted through the NIAID Systems Biology site
Rather than what it has currently, it should be"
"includedInDataCatalog" : [{"name": Data Discovery Engine", "url": "https://discovery.biothings.io/dataset/55508b2aa74742a8"},{"name": "Data Discovery Engine, NIAID Systems Biology", "url": "https://discovery.biothings.io/dataset/55508b2aa74742a8"}, {"name": "Data Discovery Engine, NIAID Data Ecosystem", "url": "https://discovery.biothings.io/dataset/55508b2aa74742a8"}]
This way, if we filter by "includedInDataCatalog.name" = "Data Discovery Engine, NIAID Data Ecosystem", we get everything submitted via NDE and NIAID SysBio portals, but if we filter by "includedInDataCatalog.name" = "Data Discovery Engine, NIAID Systems Biology", we only get the records submitted via the NIAID SysBio portal
It looks like what's on staging is behaving correctly. Please updated the fixes from staging to production
Note that a record has finally been submitted via the CREID portal in the DDE
@jal347 , please update the DDE parser such that if the '@context' for a DDE-ingested Dataset includes "creid": "https://discovery.biothings.io/view/creid/"
,
the includedInDataCatalog
field is updated to include:
[
{
"@type": "DataCatalog",
"name": "NIAID CREID Network",
"url": <url>
},
{
"@type": "DataCatalog",
"name": "NIAID Data Ecosystem",
"url": <url>
}
]
An example dataset ingested via CREID portal: https://discovery.biothings.io/api/dataset/e2730ebc22a0f38a
Background: In the future, there will be two portals through which metadata records will be ingested into the DDE. The crawler should be updated to reflect this change. We need to be able to distinguish records coming from the NIAID SysBio portal vs records coming from NDE portal.
Rational: There may be multiple portals added to the DDE in the future for which the records will be ingested into the NDE. We need to be able to filter for datasets that specifically came from the NIAID SysBio portal and distinguish it from datasets coming from a different DDE portal. At the same time, we need to be able to identify all datasets coming from the NDE (of which the NIAID SysBio, and other portals may be a part.)
Potential approach
includedInDataCatalog
is added by the DDE parser to the records ingested into the NDE, and since it the field can be an array, the parser should be edited to include an array for NIAID SysBio portal-ingested datasets.@context
value that looks has theniaid
schema, but not thende
schema as seen below:@context
value with thende
schema as seen below:includedInDataCatalog
should have a value of: