Open sierra-moxon opened 1 year ago
A couple of additional notes:
COHD and ICEES KG are currently using the "supporting data source" slot to point to the corresponding wiki pages. Those pages capture the underlying source of EHR data (and other underlying data sources), which seems like a more appropriate approach than having separate infores id's / wiki pages for each underlying data source.
Note that Mole Pro is also using the wiki page to capture information on the underlying data sources.
I think we can keep it (for EHR risk relying on Providence) just like we have the Multiomics Wellness KP relying on the ISB wellness data set. Both have infores and both are useful and valuable. It makes clear what data sources have been incorporated, hopefully preventing redundancy. Same for BigGIM and multiple resources like TCGA, CCLE, etc. I guess I'm not sure I see what the problem is. :)
infores:providence-st-joseph-ehr: This should not be deprecated. It points to a description of the data source with contact information. It happens to be the same URL we use for our KP for the non-science audience.
infores:cpt-codes-umls: This can be deprecated. We will no longer use this.
@karafecho
@sierra-moxon Can infores:providence-st-joseph-ehr be undeprecated?
yep; absolutely.
Update: You can deprecate that URL, this is the new URL with info on Providence data: https://github.com/NCATSTranslator/Translator-All/wiki/EHR-KP-Data
I have updated the infores spreadsheet.
Hi all. The original intent of the infores and retrieval provenance specifications was that infores identifiers would be created for supporting data sources, such as those mentioned above. The RetrievalSource-based modeling pattern in our refactored retrieval provenance model allows for capture of these infores ids as supporting data sources in the TRAPI message, per the specification here (see Scenario 2 in the Data Examples). Doing so provides more consistent and comprehensive provenance regarding where knowledge came from.
It sounds like Multiomics is doing this already (creating infores ids for their supporting data sources, and using these in their TRAPI data). IMO ICEES and COHD should consider updating their representation to do the same. I don't think there are an overwhelming number of supporting data sources being used - so adding inforeses for them should be doable. But correct me if I am mistaken, or if folks have other concerns. @karafecho @CaseyTa @gglusman does this sound reasonable (not for September necessarily, but in the near future).
As for what happens downstream of capturing infores URIs for supporting data sources in the TRAPI data, - the UI team has discussed how they might eventually show these in the interface, but I believe for now is relying on the Wiki pages for the primary knowledge source to describe and/or link to pages for these data sources. If the data sources all have inforeses, Wiki pages can be created for them and referenced from knowledge sources like ICEES or COHD that use them. But this can be a longer term evolution.
Thanks for your input, @mbrush.
I think we should discuss your proposal after the September release. While I understand the intent and appreciate the elegance of the proposed solution, I have practical concerns about creating too many infores id's and corresponding wiki pages. For instance, ICEES draws data from more than a dozen supporting data sources. I think it will be challenging enough to maintain an up-to-date wiki page for infores:icees-kg, let alone for all of those supporting data sources. Moreover, I don't think this will be helpful to users and likely will introduce confusion. The current approach of using the supporting data source slot to point to a wiki page with a user-friendly description of the primary knowledge source and supporting data sources seems like a more realistic and user-friendly solution. But, alas, perhaps I can be convinced otherwise. :-)
Although we've changed ours to have a separate URL for now as requested, I agree with Kara. We've found that it's not helpful for people to read about the data without understanding what specific analyses we conducted. It can also be unsettling for people to think that EHR data is used directly in Translator, without full details of what we did, and how. In the future we'll be able to point to a pubication with full methods details. We'll end up with the same information on both the data and KP pages. @ARalevski @karafecho
@karafecho - can you clarify what you meant by:
COHD and ICEES KG are currently using the "supporting data source" slot to point to the corresponding wiki pages.
Are you saying that ICESS is using the URL of a wiki page as the value of a supporting data source in TRAPI messages? Can you provide an example of what this looks like in the current ICEES data?
I ask to understand concretely what your data looks like, but also to see if you guys are using the refactored retrieval provenance model which uses a RetrievalSource
object rather than a supporting_data_source
slot. My understanding is that all KPs were required to move to this new model with the TRAPI 1.4 release a while back.
Here's what COHD has on its edges:
on each edge:
"sources": [
{
"resource_id": "infores:cohd",
"resource_role": "primary_knowledge_source"
},
{
"resource_id": "infores:columbia-cdw-ehr-data",
"resource_role": "supporting_data_source"
}
],
And on each biolink:has_supporting_study_result
attribute on each edge (multiple per edge):
"value_url": "https://github.com/NCATSTranslator/Translator-All/wiki/COHD-KP"
Comparing to the data example scenario 2, looks like we're missing the upstream_resource_ids
on infores:cohd
source. Is there anything else we need to update for compliance? The example includes a type
property, but I don't think that's in TRAPI.
In the TRAPI spec for RetrievalSource
, I also see source_record_urls, but I think I previously avoided using this since it sounds like it's intended to point to a page describing a specific edge and not just to a page describing the resource in general. Is that the right interpretation?
COHD's information is not complicated, and we can easily adopt the consensus model.
Update: I think we also had the order of the RetrievalSource list in reverse. I think it should be updated to the below. Please let me know if this looks right.
"sources": [
{
"resource_id": "infores:columbia-cdw-ehr-data",
"resource_role": "supporting_data_source"
},
{
"resource_id": "infores:cohd",
"resource_role": "primary_knowledge_source",
"upstream_resource_ids": ["infores:columbia-cdw-ehr-data"]
}
],
Thanks for this Casey. Your examples look great. This is exactly how I expected 'data-derived' edges form KPs like COHD, ICEES, Multiomics to look.
As I noted above - creating an infores for the supporting data sources does not require you to create/maintain a wiki page for it. But there should be a wiki page for every primary source of an edge (e.g. ICEES-KP, COHD-KP, Multiomics KPs) - which describes and/or links out to info about each supporting data source it draws upon to create its association edges.
Finally, note that the usptream_resource
property is nice to have but not required by the spec or TRAPI schema. So you are in compliance without it. The source_record_urls
property is likewise not required, and your interpretation of it is correct.
So, apparently, icees-kg diverged from COHD when we moved to Automat.
Here's an example of what icees-kg is providing on edges:
"sources": [
{
"resource_id": "infores:icees-kg",
"resource_role": "primary_knowledge_source",
"upstream_resource_ids": null,
"source_record_urls": null
},
{
"resource_id": "infores:automat-icees-kg",
"resource_role": "aggregator_knowledge_source",
"upstream_resource_ids": [
"infores:icees-kg"
],
"source_record_urls": null
}
{
"attribute_type_id": "biolink:supporting_data_source",
"value": "https://github.com/NCATSTranslator/Translator-All/wiki/Exposures-Provider-ICEES",
"value_type_id": "EDAM:data_0006",
"original_attribute_name": "biolink:supporting_data_source",
"value_url": null,
"attribute_source": null,
"description": null,
"attributes": null
},
Note that the wiki page provides a list of all supporting data sources with hyperlinks to corresponding web pages.
We had created infores id's for all of the supporting data sources, but we deprecated them in favor of the ICEES wiki page, which is likely to be more informative (and less alarming / deceptive) to users than, say, an infores id for UNC Health EHR data or US Census Bureau TIGER/line roadway data. Moreover, in some cases, it's not clear what the supporting data source would be, e.g., clinical study datasets.
We have a TCDC meeting on Wednesday at 2 pm ET. I can add this topic to the agenda, if that would be helpful. Just let me know.
Here's a list of the infores supporting data sources for icees-kg:
- id: infores:unc-cdwh-ehr-data
status: released
name: UNC Carolina Data Warehouse for Health Patient EHR Data
xref:
- https://tracs.unc.edu/index.php/services/informatics-and-data-science/cdw-h
knowledge level: curated
agent type: not_provided
- id: infores:niehs-epr-study-datae
status: deprecated
name: NIEHS Environmental Polymorphisms Registry
synonym:
- NIEHS EPR
knowledge level: curated
agent type: not_provided
- id: infores:dili-network-study-data
status: released
name: Drug-Induced Liver Injury Network (DILIN) Participant Data
knowledge level: correlated
agent type: not_provided
- id: infores:us-epa-airborne-pollutant-exposures-data
status: deprecated
name: United States Environmental Protection Agency Airborne Pollutant Exposures Data
knowledge level: curated
agent type: not_provided
- id: infores:ncdeq-cafo-exposures-data
status: deprecated
name: North Carolina Department of Environmental Quality Concentrated Animal Feeding Operations Exposures Data
knowledge level: curated
agent type: not_provided
- id: infores:ncdeq-landfill-exposures-data
status: deprecated
name: North Carolina Department of Environmental Quality Landfill Exposures Data
knowledge level: curated
agent type: not_provided
- id: infores:nces-schools-exposure-data
status: released
name: NCES public school exposures data
xref:
- https://nces.ed.gov/
synonym:
- NCES Data
knowledge level: curated
agent type: not_provided
- id: infores:us-census-acs-data
status: released
name: United States Census Bureau American Community Survey Data
xref:
- https://www.census.gov/programs-surveys/acs/data.html
knowledge level: curated
agent type: not_provided
- id: infores:us-census-tiger-roadway-exposures-data
status: released
name: United States Census Bureau TIGER/line Roadway Data
xref:
- http://www.census.gov/geo/maps-data/data/tiger-line.html
knowledge level: curated
agent type: not_provided
- id: infores:us-dot-roadway-exposures-data
status: released
name: United States Department of Transportation Roadway Exposures Data
xref:
- https://highways.dot.gov/
knowledge level: curated
agent type: not_provided
A few notes: (1) A few supporting data sources are missing infores id's, but the main sources are represented. (2) A few infores id's are missing URLs. (3) The infores id for the NIEHS EPR dataset contains a typo. (4) I thought I marked them all as "to be deprecated", but it looks like only a subset was actually deprecated.
Hi @karafecho . The following is out of compliance with the latest spec for capturing 'supporting_data_source' metadata.
{
"attribute_type_id": "biolink:supporting_data_source",
"value": "https://github.com/NCATSTranslator/Translator-All/wiki/Exposures-Provider-ICEES",
"value_type_id": "EDAM:data_0006",
"original_attribute_name": "biolink:supporting_data_source",
"value_url": null,
"attribute_source": null,
"description": null,
"attributes": null
},
RetrievalSource objects should be used for this info - as in Casey's COHD examples,
To clarify, is your concern with doing this that you would have to create 10 supporting data source objects on every ICEES edge (one for each of the data sources you list above) . . . because there is no way to tell which subset of these 10 actually provided the data supporting the calculations reported in a given edge?
If this is the case, it may be ok to just leave supporting data sources out of the ICEES edge metadata for now - and rely on the wiki page for the primary source to describe these data sources for users. I don't think the UI is using the Attribute object above for anything anyway.
Alternatively, we could make the 'RetrievalSource.resource_id` field to be multivalued in the TRAPI spec - so you could just create a list of all supporting data inforeses in a single object.
Let me know what you think - happy to hop on the CDWG call Wednesday if you'd like to discuss, or raise this concern on the TRAPI call this week.
Thanks, @mbrush. I'll work with the Automat folks to get icees-kg back in compliance with the latest TRAPI spec.
WRT my concerns regarding supporting data sources, I actually have several.
The first is the one you point out, which is that we would need to create 10+ supporting data source objects on every icees-kg edge. We could tailor these to specific edges, but that would entail a lot of work, and I'm not sure it would simplify things. In terms of a solution, I think that your proposal to make the 'RetrievalSource.resource_id' field multivalued seems like the optimal long-term solution, but I think that removing the supporting data sources is probably the best short-term solution.
My second concern relates to the comments that Jenn and I both made, which is that it may be confusing / alarming / deceiving to provide users with supporting data information for sources such as EHR data, or roadway data, or socioeconomic data, etc. without also providing context.
My third concern is that it is sometimes unclear what the supporting data source actually is. For instance, how would we reference a clinical study dataset?
Thanks for offering to join Wednesday's TCDC call. It sounds like this is primarily an Exposures Provider issue, so I'm not sure it makes sense to add this to the agenda, but I'll defer to @jh111 and @CaseyTa.
@karafecho I'm okay with adding to the agenda.
Will do! Thanks, Casey.
@mbrush : I've asked Tursynay to send you an invite to tomorrow's 2 pm ET TCDC meeting. Hope this time slot still works for you.
Decision per TCDC meeting, August 16, 2023:
@mbrush @sierra-moxon : Does this sound right?
Yes, thanks for summarizing @karafecho. I added this issue to the TRAPI repo here with specific items to discuss on their end post-September.
Multiomics has infores id's for their underlying source of EHR data, which are not directly exposed in Translator. COHD and ICEES do not, but rather capture this information on the wiki pages. Which is the correct convention here?
related: we need an attribute for the specific page at an infores that shows evidence for an edge. we currently have: