How can we harmonize supporting data sources?

sierra-moxon commented 1 year ago

Multiomics has infores id's for their underlying source of EHR data, which are not directly exposed in Translator. COHD and ICEES do not, but rather capture this information on the wiki pages. Which is the correct convention here?

related: we need an attribute for the specific page at an infores that shows evidence for an edge. we currently have:

xref
supporting data set
web page and its not clear which to use.

karafecho commented 1 year ago

A couple of additional notes:

COHD and ICEES KG are currently using the "supporting data source" slot to point to the corresponding wiki pages. Those pages capture the underlying source of EHR data (and other underlying data sources), which seems like a more appropriate approach than having separate infores id's / wiki pages for each underlying data source.

Note that Mole Pro is also using the wiki page to capture information on the underlying data sources.

gglusman commented 1 year ago

I think we can keep it (for EHR risk relying on Providence) just like we have the Multiomics Wellness KP relying on the ISB wellness data set. Both have infores and both are useful and valuable. It makes clear what data sources have been incorporated, hopefully preventing redundancy. Same for BigGIM and multiple resources like TCGA, CCLE, etc. I guess I'm not sure I see what the problem is. :)

jh111 commented 1 year ago

infores:providence-st-joseph-ehr: This should not be deprecated. It points to a description of the data source with contact information. It happens to be the same URL we use for our KP for the non-science audience.

infores:cpt-codes-umls: This can be deprecated. We will no longer use this.

@karafecho

jh111 commented 1 year ago

@sierra-moxon Can infores:providence-st-joseph-ehr be undeprecated?

sierra-moxon commented 1 year ago

yep; absolutely.

ARalevski commented 1 year ago

Update: You can deprecate that URL, this is the new URL with info on Providence data: https://github.com/NCATSTranslator/Translator-All/wiki/EHR-KP-Data

I have updated the infores spreadsheet.

mbrush commented 1 year ago

Hi all. The original intent of the infores and retrieval provenance specifications was that infores identifiers would be created for supporting data sources, such as those mentioned above. The RetrievalSource-based modeling pattern in our refactored retrieval provenance model allows for capture of these infores ids as supporting data sources in the TRAPI message, per the specification here (see Scenario 2 in the Data Examples). Doing so provides more consistent and comprehensive provenance regarding where knowledge came from.

It sounds like Multiomics is doing this already (creating infores ids for their supporting data sources, and using these in their TRAPI data). IMO ICEES and COHD should consider updating their representation to do the same. I don't think there are an overwhelming number of supporting data sources being used - so adding inforeses for them should be doable. But correct me if I am mistaken, or if folks have other concerns. @karafecho @CaseyTa @gglusman does this sound reasonable (not for September necessarily, but in the near future).

As for what happens downstream of capturing infores URIs for supporting data sources in the TRAPI data, - the UI team has discussed how they might eventually show these in the interface, but I believe for now is relying on the Wiki pages for the primary knowledge source to describe and/or link to pages for these data sources. If the data sources all have inforeses, Wiki pages can be created for them and referenced from knowledge sources like ICEES or COHD that use them. But this can be a longer term evolution.

karafecho commented 1 year ago

Thanks for your input, @mbrush.

I think we should discuss your proposal after the September release. While I understand the intent and appreciate the elegance of the proposed solution, I have practical concerns about creating too many infores id's and corresponding wiki pages. For instance, ICEES draws data from more than a dozen supporting data sources. I think it will be challenging enough to maintain an up-to-date wiki page for infores:icees-kg, let alone for all of those supporting data sources. Moreover, I don't think this will be helpful to users and likely will introduce confusion. The current approach of using the supporting data source slot to point to a wiki page with a user-friendly description of the primary knowledge source and supporting data sources seems like a more realistic and user-friendly solution. But, alas, perhaps I can be convinced otherwise. :-)

jh111 commented 1 year ago

Although we've changed ours to have a separate URL for now as requested, I agree with Kara. We've found that it's not helpful for people to read about the data without understanding what specific analyses we conducted. It can also be unsettling for people to think that EHR data is used directly in Translator, without full details of what we did, and how. In the future we'll be able to point to a pubication with full methods details. We'll end up with the same information on both the data and KP pages. @ARalevski @karafecho

mbrush commented 1 year ago

@karafecho - can you clarify what you meant by:

COHD and ICEES KG are currently using the "supporting data source" slot to point to the corresponding wiki pages.

Are you saying that ICESS is using the URL of a wiki page as the value of a supporting data source in TRAPI messages? Can you provide an example of what this looks like in the current ICEES data?

I ask to understand concretely what your data looks like, but also to see if you guys are using the refactored retrieval provenance model which uses a RetrievalSource object rather than a supporting_data_source slot. My understanding is that all KPs were required to move to this new model with the TRAPI 1.4 release a while back.

CaseyTa commented 1 year ago

Here's what COHD has on its edges:

on each edge:

                    "sources": [
                        {
                            "resource_id": "infores:cohd",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:columbia-cdw-ehr-data",
                            "resource_role": "supporting_data_source"
                        }
                    ],

And on each biolink:has_supporting_study_result attribute on each edge (multiple per edge): "value_url": "https://github.com/NCATSTranslator/Translator-All/wiki/COHD-KP"

Comparing to the data example scenario 2, looks like we're missing the upstream_resource_ids on infores:cohd source. Is there anything else we need to update for compliance? The example includes a type property, but I don't think that's in TRAPI.

In the TRAPI spec for RetrievalSource, I also see source_record_urls, but I think I previously avoided using this since it sounds like it's intended to point to a page describing a specific edge and not just to a page describing the resource in general. Is that the right interpretation?

COHD's information is not complicated, and we can easily adopt the consensus model.

Update: I think we also had the order of the RetrievalSource list in reverse. I think it should be updated to the below. Please let me know if this looks right.

                    "sources": [
                        {
                            "resource_id": "infores:columbia-cdw-ehr-data",
                            "resource_role": "supporting_data_source"
                        },
                        {
                            "resource_id": "infores:cohd",
                            "resource_role": "primary_knowledge_source",
                            "upstream_resource_ids": ["infores:columbia-cdw-ehr-data"]
                        }
                    ],

mbrush commented 1 year ago

Thanks for this Casey. Your examples look great. This is exactly how I expected 'data-derived' edges form KPs like COHD, ICEES, Multiomics to look.

As I noted above - creating an infores for the supporting data sources does not require you to create/maintain a wiki page for it. But there should be a wiki page for every primary source of an edge (e.g. ICEES-KP, COHD-KP, Multiomics KPs) - which describes and/or links out to info about each supporting data source it draws upon to create its association edges.

Finally, note that the usptream_resource property is nice to have but not required by the spec or TRAPI schema. So you are in compliance without it. The source_record_urls property is likewise not required, and your interpretation of it is correct.

karafecho commented 1 year ago

So, apparently, icees-kg diverged from COHD when we moved to Automat.

Here's an example of what icees-kg is providing on edges:

      "sources": [
        {
          "resource_id": "infores:icees-kg",
          "resource_role": "primary_knowledge_source",
          "upstream_resource_ids": null,
          "source_record_urls": null
        },
        {
          "resource_id": "infores:automat-icees-kg",
          "resource_role": "aggregator_knowledge_source",
          "upstream_resource_ids": [
            "infores:icees-kg"
          ],
          "source_record_urls": null
        }

        {
          "attribute_type_id": "biolink:supporting_data_source",
          "value": "https://github.com/NCATSTranslator/Translator-All/wiki/Exposures-Provider-ICEES",
          "value_type_id": "EDAM:data_0006",
          "original_attribute_name": "biolink:supporting_data_source",
          "value_url": null,
          "attribute_source": null,
          "description": null,
          "attributes": null
        },

Note that the wiki page provides a list of all supporting data sources with hyperlinks to corresponding web pages.

We had created infores id's for all of the supporting data sources, but we deprecated them in favor of the ICEES wiki page, which is likely to be more informative (and less alarming / deceptive) to users than, say, an infores id for UNC Health EHR data or US Census Bureau TIGER/line roadway data. Moreover, in some cases, it's not clear what the supporting data source would be, e.g., clinical study datasets.

We have a TCDC meeting on Wednesday at 2 pm ET. I can add this topic to the agenda, if that would be helpful. Just let me know.

karafecho commented 1 year ago

Here's a list of the infores supporting data sources for icees-kg:

  - id: infores:unc-cdwh-ehr-data
    status: released
    name: UNC Carolina Data Warehouse for Health Patient EHR Data
    xref:
      - https://tracs.unc.edu/index.php/services/informatics-and-data-science/cdw-h
    knowledge level: curated
    agent type: not_provided

  - id: infores:niehs-epr-study-datae
    status: deprecated
    name: NIEHS Environmental Polymorphisms Registry
    synonym:
      - NIEHS EPR
    knowledge level: curated
    agent type: not_provided

  - id: infores:dili-network-study-data
    status: released
    name: Drug-Induced Liver Injury Network (DILIN) Participant Data
    knowledge level: correlated
    agent type: not_provided

  - id: infores:us-epa-airborne-pollutant-exposures-data
    status: deprecated
    name: United States Environmental Protection Agency Airborne Pollutant Exposures Data
    knowledge level: curated
    agent type: not_provided

  - id: infores:ncdeq-cafo-exposures-data
    status: deprecated
    name: North Carolina Department of Environmental Quality Concentrated Animal Feeding Operations Exposures Data
    knowledge level: curated
    agent type: not_provided

  - id: infores:ncdeq-landfill-exposures-data
    status: deprecated
    name: North Carolina Department of Environmental Quality Landfill Exposures Data
    knowledge level: curated
    agent type: not_provided

  - id: infores:nces-schools-exposure-data
    status: released
    name: NCES public school exposures data
    xref:
      - https://nces.ed.gov/
    synonym:
      - NCES Data
    knowledge level: curated
    agent type: not_provided

  - id: infores:us-census-acs-data
    status: released
    name: United States Census Bureau American Community Survey Data
    xref:
      - https://www.census.gov/programs-surveys/acs/data.html
    knowledge level: curated
    agent type: not_provided

  - id: infores:us-census-tiger-roadway-exposures-data
    status: released
    name: United States Census Bureau TIGER/line Roadway Data
    xref:
      - http://www.census.gov/geo/maps-data/data/tiger-line.html
    knowledge level: curated
    agent type: not_provided

  - id: infores:us-dot-roadway-exposures-data
    status: released
    name: United States Department of Transportation Roadway Exposures Data
    xref:
      - https://highways.dot.gov/
    knowledge level: curated
    agent type: not_provided

A few notes: (1) A few supporting data sources are missing infores id's, but the main sources are represented. (2) A few infores id's are missing URLs. (3) The infores id for the NIEHS EPR dataset contains a typo. (4) I thought I marked them all as "to be deprecated", but it looks like only a subset was actually deprecated.

mbrush commented 1 year ago

Hi @karafecho . The following is out of compliance with the latest spec for capturing 'supporting_data_source' metadata.

        {
          "attribute_type_id": "biolink:supporting_data_source",
          "value": "https://github.com/NCATSTranslator/Translator-All/wiki/Exposures-Provider-ICEES",
          "value_type_id": "EDAM:data_0006",
          "original_attribute_name": "biolink:supporting_data_source",
          "value_url": null,
          "attribute_source": null,
          "description": null,
          "attributes": null
        },

RetrievalSource objects should be used for this info - as in Casey's COHD examples,

To clarify, is your concern with doing this that you would have to create 10 supporting data source objects on every ICEES edge (one for each of the data sources you list above) . . . because there is no way to tell which subset of these 10 actually provided the data supporting the calculations reported in a given edge?

If this is the case, it may be ok to just leave supporting data sources out of the ICEES edge metadata for now - and rely on the wiki page for the primary source to describe these data sources for users. I don't think the UI is using the Attribute object above for anything anyway.

Alternatively, we could make the 'RetrievalSource.resource_id` field to be multivalued in the TRAPI spec - so you could just create a list of all supporting data inforeses in a single object.

Let me know what you think - happy to hop on the CDWG call Wednesday if you'd like to discuss, or raise this concern on the TRAPI call this week.

karafecho commented 1 year ago

Thanks, @mbrush. I'll work with the Automat folks to get icees-kg back in compliance with the latest TRAPI spec.

WRT my concerns regarding supporting data sources, I actually have several.

The first is the one you point out, which is that we would need to create 10+ supporting data source objects on every icees-kg edge. We could tailor these to specific edges, but that would entail a lot of work, and I'm not sure it would simplify things. In terms of a solution, I think that your proposal to make the 'RetrievalSource.resource_id' field multivalued seems like the optimal long-term solution, but I think that removing the supporting data sources is probably the best short-term solution.

My second concern relates to the comments that Jenn and I both made, which is that it may be confusing / alarming / deceiving to provide users with supporting data information for sources such as EHR data, or roadway data, or socioeconomic data, etc. without also providing context.

My third concern is that it is sometimes unclear what the supporting data source actually is. For instance, how would we reference a clinical study dataset?

Thanks for offering to join Wednesday's TCDC call. It sounds like this is primarily an Exposures Provider issue, so I'm not sure it makes sense to add this to the agenda, but I'll defer to @jh111 and @CaseyTa.

CaseyTa commented 1 year ago

@karafecho I'm okay with adding to the agenda.

karafecho commented 1 year ago

Will do! Thanks, Casey.

karafecho commented 1 year ago

@mbrush : I've asked Tursynay to send you an invite to tomorrow's 2 pm ET TCDC meeting. Hope this time slot still works for you.

karafecho commented 1 year ago

Decision per TCDC meeting, August 16, 2023:

Pre September release, Kara and Max will remove the non-compliant attribute block from icees-kg.
Pre September release, Kara will confirm that all icees-kg supporting data sources in the infores catalog are complete, and she will undeprecate sources that have been deprecated.
Pre September release, Kara will confirm that each of the supporting data sources linked out from the icees-kg wiki page have a concise but informative description.
Post September release, Matt will add two related topics to the agenda for an upcoming TRAPI WG meeting: (a) discuss proposal from icees-kg team to make the 'RetrievalSource.resource_id' field multivalued; and (b) consider whether supporting data sources should ever be exposed to users and, if so, define approaches for ensuring that users have sufficient information to properly interpret the sources.

@mbrush @sierra-moxon : Does this sound right?

mbrush commented 1 year ago

Yes, thanks for summarizing @karafecho. I added this issue to the TRAPI repo here with specific items to discuss on their end post-September.

biolink / biolink-model

How can we harmonize supporting data sources? #1352