Bioconductor / GenomicDataCommons

Provide R access to the NCI Genomic Data Commons portal.
http://bioconductor.github.io/GenomicDataCommons/
83 stars 23 forks source link

Slide file query shows file incorrectly associated with multiple slides when GDC portal shows it's associated with only one #95

Closed hermidalc closed 2 years ago

hermidalc commented 2 years ago

For example:

gdc_query <-
    files() %>%
    GenomicDataCommons::filter(
        cases.project.project_id == "TCGA-BRCA"
        & cases.samples.sample_type == "Primary Tumor"
        & data_type == "Slide Image"
        & file_id == "a9d0c0a1-6cf2-4e1a-bcce-48d705d50809"
    ) %>%
    GenomicDataCommons::select(c(
        "file_name",
        "cases.samples.portions.portion_id",
        "cases.samples.portions.submitter_id",
        "cases.samples.portions.slides.slide_id",
        "cases.samples.portions.slides.submitter_id"
    ))
gdc_results <- results_all(gdc_query)
gdc_results$cases[[1]]$samples[[1]]$portions[[1]]$slides

Shows:

[[1]]
             submitter_id                             slide_id
1 TCGA-A8-A06X-01A-02-MS2 0554a423-cfcb-4daa-9c57-dd3960aa2614
2 TCGA-A8-A06X-01A-02-BS2 eaf95778-4d99-4a85-bff2-57e2c66bebac

[[2]]
             submitter_id                             slide_id
1 TCGA-A8-A06X-01A-02-BS2 eaf95778-4d99-4a85-bff2-57e2c66bebac
2 TCGA-A8-A06X-01A-02-MS2 0554a423-cfcb-4daa-9c57-dd3960aa2614

But the GDC data portal https://portal.gdc.cancer.gov/files/a9d0c0a1-6cf2-4e1a-bcce-48d705d50809 shows that the slide image file is only associated with TCGA-A8-A06X-01A-02-MS2

2022-02-22 (2)

Is there some insider information about the GDC API that I'm missing here? Because unless there is some additional information to filter on which would tell me the same relationship in the data portal then it's kinda of worrisome. I checked the state of the slides and portions in question and they both show released.

LiNk-NY commented 2 years ago

Hi Leandro, @hermidalc This is not an error. It is a feature of the API. When querying the cases side of the API, you'd have to restrict somehow so that you don't get all of the possible files under a particular case. That's what you're seeing here. I'm not sure what you're looking for but you can run this query to get some information associated only with the file:

gdc_query <- files() %>% 
    GenomicDataCommons::filter(
        file_name == "TCGA-A8-A06X-01A-02-MS2.0554a423-cfcb-4daa-9c57-dd3960aa2614.svs"
    )
res <- results_all(gdc_query)
hermidalc commented 2 years ago

Hi Leandro, @hermidalc This is not an error. It is a feature of the API. When querying the cases side of the API, you'd have to restrict somehow so that you don't get all of the possible files under a particular case. That's what you're seeing here. I'm not sure what you're looking for but you can run this query to get some information associated only with the file:

gdc_query <- files() %>% 
    GenomicDataCommons::filter(
        file_name == "TCGA-A8-A06X-01A-02-MS2.0554a423-cfcb-4daa-9c57-dd3960aa2614.svs"
    )
res <- results_all(gdc_query)

Thank you very much, though that was just to show an example for the issue post, my real query is for all the slide image files for a cancer and their parent portions. Weird that for some slides the same slide can be associated with two different portions? That doesn't make sense

LiNk-NY commented 2 years ago

I think when you query against the cases side, you will get any slides corresponding to the the case UUID.

Btw @LiNk-NY may I ask you, how do I do file API queries using file name regex patterns or even easier for like substrings ""?

I am not sure that is possible. You'd have to ask the GDC directly.

Best, Marcel

hermidalc commented 2 years ago

Actually @LiNk-NY I didn't realize you wrote I was querying from the cases side, but if you see in the OP I've always been querying from the files side, like I showed above you get representations that don't make sense in the GenomicDataCommons results, like e.g.

gdc_query <-
    files() %>%
    GenomicDataCommons::filter(
        cases.project.project_id == "TCGA-BRCA"
        & cases.samples.sample_type == "Primary Tumor"
        & data_type == "Slide Image"
        & experimental_strategy == "Diagnostic Slide"
    ) %>%
    GenomicDataCommons::select(c(
        "file_name",
        "cases.samples.portions.slides.slide_id",
        "cases.samples.portions.slides.submitter_id"
    ))
gdc_results <- results_all(gdc_query)
sapply(sapply(sapply(gdc_results$cases, `[[`, "samples"), `[[`, "portions"), `[[`, "slides")

Here's an example from the output. Why would slide image file 17ec20fd-267d-4fb4-b236-822a7934c9f1 show to be associated with two slides, when really it's only associated with the first one?

$`17ec20fd-267d-4fb4-b236-822a7934c9f1`                                                                                                                                                               
[[1]]
             submitter_id                             slide_id
1 TCGA-D8-A27V-01Z-00-DX1 9c2d8cab-1120-4e1b-9715-0d2dbfd7cab0

[[2]]
             submitter_id                             slide_id
1 TCGA-D8-A27V-01Z-00-DX2 444aa9cd-113d-451b-a747-3c85cd89a36d
seandavi commented 2 years ago

Hi, @hermidalc. You'll need to check with the GDC folks about how to do the query you are asking for. I'm not sure that it is supported by the API. I don't see an obvious analog in the Portal UI, so they may not have that capability in the REST API.

More specifically, when you use the path cases-->samples-->portions-->slides-->slide_id, you get all the slides for the case, not the slides for the file that you started with. I may not have had enough coffee, but I don't see the path to get from the file back to the single analogous slide using the API.

hermidalc commented 2 years ago

Hi, @hermidalc. You'll need to check with the GDC folks about how to do the query you are asking for. I'm not sure that it is supported by the API. I don't see an obvious analog in the Portal UI, so they may not have that capability in the REST API.

More specifically, when you use the path cases-->samples-->portions-->slides-->slide_id, you get all the slides for the case, not the slides for the file that you started with. I may not have had enough coffee, but I don't see the path to get from the file back to the single analogous slide using the API.

Thank you @seandavi not worries, I understand better now that there are limitations in the GDC graph DB structure wrt these query API questions I'm having. I can interrogate the results and post-process them on my end to get the results I want.

hermidalc commented 2 years ago

Hi @seandavi and @LiNk-NY, I asked the GDC how they are able to show on their data portal web page only the correctly associated slide (and not all the slides for that case), you can do it by adding associated_entities.entity_submitter_id to the API query, which has only the associated slide submitter id:

gdc_query <-
    files() %>%
    GenomicDataCommons::filter(
        cases.project.project_id == "TCGA-BRCA"
        & cases.samples.sample_type == "Primary Tumor"
        & data_type == "Slide Image"
        & experimental_strategy == "Diagnostic Slide"
    ) %>%
    GenomicDataCommons::select(c(
        "file_name",
        "associated_entities.entity_submitter_id",
        "cases.samples.portions.slides.slide_id",
        "cases.samples.portions.slides.submitter_id"
    ))
gdc_results <- results_all(gdc_query)

And then post-process the data structure filtering out cases.samples.portions.slides.submitter_id that do not equal associated_entities.entity_submitter_id

LiNk-NY commented 2 years ago

Thanks for checking back in @hermidalc ! I've updated my convenience function in TCGAutils::filenameToBarcode to work with slides and this query. Best, Marcel