Closed hermidalc closed 2 years ago
Hi Leandro, @hermidalc This is not an error. It is a feature of the API. When querying the cases side of the API, you'd have to restrict somehow so that you don't get all of the possible files under a particular case. That's what you're seeing here. I'm not sure what you're looking for but you can run this query to get some information associated only with the file:
gdc_query <- files() %>%
GenomicDataCommons::filter(
file_name == "TCGA-A8-A06X-01A-02-MS2.0554a423-cfcb-4daa-9c57-dd3960aa2614.svs"
)
res <- results_all(gdc_query)
Hi Leandro, @hermidalc This is not an error. It is a feature of the API. When querying the cases side of the API, you'd have to restrict somehow so that you don't get all of the possible files under a particular case. That's what you're seeing here. I'm not sure what you're looking for but you can run this query to get some information associated only with the file:
gdc_query <- files() %>% GenomicDataCommons::filter( file_name == "TCGA-A8-A06X-01A-02-MS2.0554a423-cfcb-4daa-9c57-dd3960aa2614.svs" ) res <- results_all(gdc_query)
Thank you very much, though that was just to show an example for the issue post, my real query is for all the slide image files for a cancer and their parent portions. Weird that for some slides the same slide can be associated with two different portions? That doesn't make sense
I think when you query against the cases side, you will get any slides corresponding to the the case UUID.
Btw @LiNk-NY may I ask you, how do I do file API queries using file name regex patterns or even easier for like substrings ""?
I am not sure that is possible. You'd have to ask the GDC directly.
Best, Marcel
Actually @LiNk-NY I didn't realize you wrote I was querying from the cases side, but if you see in the OP I've always been querying from the files side, like I showed above you get representations that don't make sense in the GenomicDataCommons
results, like e.g.
gdc_query <-
files() %>%
GenomicDataCommons::filter(
cases.project.project_id == "TCGA-BRCA"
& cases.samples.sample_type == "Primary Tumor"
& data_type == "Slide Image"
& experimental_strategy == "Diagnostic Slide"
) %>%
GenomicDataCommons::select(c(
"file_name",
"cases.samples.portions.slides.slide_id",
"cases.samples.portions.slides.submitter_id"
))
gdc_results <- results_all(gdc_query)
sapply(sapply(sapply(gdc_results$cases, `[[`, "samples"), `[[`, "portions"), `[[`, "slides")
Here's an example from the output. Why would slide image file 17ec20fd-267d-4fb4-b236-822a7934c9f1
show to be associated with two slides, when really it's only associated with the first one?
$`17ec20fd-267d-4fb4-b236-822a7934c9f1`
[[1]]
submitter_id slide_id
1 TCGA-D8-A27V-01Z-00-DX1 9c2d8cab-1120-4e1b-9715-0d2dbfd7cab0
[[2]]
submitter_id slide_id
1 TCGA-D8-A27V-01Z-00-DX2 444aa9cd-113d-451b-a747-3c85cd89a36d
Hi, @hermidalc. You'll need to check with the GDC folks about how to do the query you are asking for. I'm not sure that it is supported by the API. I don't see an obvious analog in the Portal UI, so they may not have that capability in the REST API.
More specifically, when you use the path cases-->samples-->portions-->slides-->slide_id
, you get all the slides for the case, not the slides for the file that you started with. I may not have had enough coffee, but I don't see the path to get from the file back to the single analogous slide using the API.
Hi, @hermidalc. You'll need to check with the GDC folks about how to do the query you are asking for. I'm not sure that it is supported by the API. I don't see an obvious analog in the Portal UI, so they may not have that capability in the REST API.
More specifically, when you use the path
cases-->samples-->portions-->slides-->slide_id
, you get all the slides for the case, not the slides for the file that you started with. I may not have had enough coffee, but I don't see the path to get from the file back to the single analogous slide using the API.
Thank you @seandavi not worries, I understand better now that there are limitations in the GDC graph DB structure wrt these query API questions I'm having. I can interrogate the results and post-process them on my end to get the results I want.
Hi @seandavi and @LiNk-NY, I asked the GDC how they are able to show on their data portal web page only the correctly associated slide (and not all the slides for that case), you can do it by adding associated_entities.entity_submitter_id
to the API query, which has only the associated slide submitter id:
gdc_query <-
files() %>%
GenomicDataCommons::filter(
cases.project.project_id == "TCGA-BRCA"
& cases.samples.sample_type == "Primary Tumor"
& data_type == "Slide Image"
& experimental_strategy == "Diagnostic Slide"
) %>%
GenomicDataCommons::select(c(
"file_name",
"associated_entities.entity_submitter_id",
"cases.samples.portions.slides.slide_id",
"cases.samples.portions.slides.submitter_id"
))
gdc_results <- results_all(gdc_query)
And then post-process the data structure filtering out cases.samples.portions.slides.submitter_id
that do not equal associated_entities.entity_submitter_id
Thanks for checking back in @hermidalc !
I've updated my convenience function in TCGAutils::filenameToBarcode
to work with slides and this query.
Best,
Marcel
For example:
Shows:
But the GDC data portal https://portal.gdc.cancer.gov/files/a9d0c0a1-6cf2-4e1a-bcce-48d705d50809 shows that the slide image file is only associated with
TCGA-A8-A06X-01A-02-MS2
Is there some insider information about the GDC API that I'm missing here? Because unless there is some additional information to filter on which would tell me the same relationship in the data portal then it's kinda of worrisome. I checked the
state
of the slides and portions in question and they both showreleased
.