Closed hermidalc closed 1 year ago
In a somewhat related issue...
gdc_query <-
files() %>%
GenomicDataCommons::filter(
cases.project.project_id == "TCGA-BRCA"
& cases.samples.sample_type == "Primary Tumor"
& analysis.workflow_type == "HTSeq - Counts"
) %>%
GenomicDataCommons::select(c(
"file_name",
"analysis.workflow_type",
"cases.project.project_id",
"cases.case_id",
"cases.submitter_id",
"cases.samples.sample_id",
"cases.samples.submitter_id",
"cases.samples.sample_type",
"cases.samples.is_ffpe",
"cases.samples.portions.portion_id",
"cases.samples.portions.submitter_id",
"cases.samples.portions.is_ffpe",
"cases.samples.portions.analytes.aliquots.aliquot_id",
"cases.samples.portions.analytes.aliquots.submitter_id"
))
gdc_results <- results_all(gdc_query)
I would expect with this query that I should only be getting the aliquots, analytes, portions, samples, and cases that have HTSeq - Counts files. But when I drill down into the portions:
sapply(sapply(sapply(gdc_results$cases, `[[`, "samples"), `[[`, "portions"), `[[`, "portion_id")
For example the last few output elements:
$`b0cf3eb1-f9aa-4062-a61d-a0719155ecad`
[1] "3a0e4ae9-4d68-4af3-938b-c12654f29591"
$`58d983ee-8d34-4276-95d2-a660a87f1c2e`
[1] "78b36e1c-89ca-40dd-b0f7-ccb3de92d60d" "3cd08852-4369-5963-b564-fab2959b6691" "bbd3b2b8-afe1-5739-9a57-0b095e2ab66c"
$`2154c9df-56e8-4281-b917-9618ea1224dc`
[1] "b549dda5-678d-49cd-b464-609717682062" "1dd8e696-aecd-54f2-88d6-ea2c9376b219"
$`ed6d9b3b-bab4-466b-9238-6febe47cd076`
[1] "3f72044d-c463-4a11-a68b-ee4b7ea25c68"
$`f7c0dcbd-6704-41c7-8516-a83a7909d027`
[1] "20cea80d-c36d-4f79-872a-d3302280a70e"
When I examined these files with multiple portions I see the only one portion in each group has an analyte that isn't NULL and can be traced back to the file, the other portions are garbage that don't see to be attached to anything at the GDC.
$`58d983ee-8d34-4276-95d2-a660a87f1c2e`
portion_id analytes submitter_id state is_ffpe
1 78b36e1c-89ca-40dd-b0f7-ccb3de92d60d de01d9b6-e29d-495e-815a-c7d2a7c15af3, TCGA-PL-A8LZ-01A-31R-A36F-07 TCGA-PL-A8LZ-01A-31 released FALSE
2 3cd08852-4369-5963-b564-fab2959b6691 NULL <NA> <NA> NA
3 bbd3b2b8-afe1-5739-9a57-0b095e2ab66c NULL <NA> <NA> NA
$`2154c9df-56e8-4281-b917-9618ea1224dc`
portion_id analytes submitter_id state is_ffpe
1 b549dda5-678d-49cd-b464-609717682062 2ca5942e-dcbf-46f1-8427-1ea7216dc559, TCGA-PL-A8LY-01A-11R-A41B-07 TCGA-PL-A8LY-01A-11 released FALSE
2 1dd8e696-aecd-54f2-88d6-ea2c9376b219 NULL <NA> <NA> NA
How do I get rid of these? Adding to the query filter cases.samples.portions.state == "released"
is that the generally preferred way for all similar situations? Or in the OP using the appropriate filter syntax for "not missing"?
I added to the query filter cases.samples.portions.state == "released"
but that had no affect, the portions with NULL metadata rows, NULL linked analytes, and NA state.... i.e. not really attached to anything are still included in the results.
Unfortunately, the "lower" parts of the cases
data structures are not tied to the original file query through a direct relationship. In other words, the relationship is between file and case ONLY. All the additional case data does not have any key relationship back to file, so you'll get lots of "case" stuff that is unrelated to, in this case, HTSeq-counts. Any portions attached to the case are just that and the data model for a portion does not require it to have any associated analyses, etc. Hope that helps, but feel free to reopen if I didn't get to everything here.
Thanks @seandavi, though for some reason I cannot reopen the issue. The OP feature request still holds? What is the filter
syntax from IS MISSING
and NOT MISSING
operators in GDC query language? https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Advanced_Search/#is-missing-operator
If my previous response is a valid feature request please re-open the issue, though if not and I'm "missing" something (pun intended!) pls tell me
Hi Leandro, @hermidalc Thanks for the request. I've created PR #96 for review. Feel free to test out the "missing" branch. Best, Marcel
This should be implemented in #96
Sorry if I missed it in the docs or somewhere, but what is the
filter
syntax for GQL "missing" and "not missing"?