Bioconductor / GenomicDataCommons

Provide R access to the NCI Genomic Data Commons portal.
http://bioconductor.github.io/GenomicDataCommons/
84 stars 23 forks source link

filter syntax for GDC query API "IS MISSING" and "NOT MISSING" operators #93

Closed hermidalc closed 1 year ago

hermidalc commented 2 years ago

Sorry if I missed it in the docs or somewhere, but what is the filter syntax for GQL "missing" and "not missing"?

hermidalc commented 2 years ago

In a somewhat related issue...

gdc_query <-
    files() %>%
    GenomicDataCommons::filter(
        cases.project.project_id == "TCGA-BRCA"
        & cases.samples.sample_type == "Primary Tumor"
        & analysis.workflow_type == "HTSeq - Counts"
    ) %>%
    GenomicDataCommons::select(c(
        "file_name",
        "analysis.workflow_type",
        "cases.project.project_id",
        "cases.case_id",
        "cases.submitter_id",
        "cases.samples.sample_id",
        "cases.samples.submitter_id",
        "cases.samples.sample_type",
        "cases.samples.is_ffpe",
        "cases.samples.portions.portion_id",
        "cases.samples.portions.submitter_id",
        "cases.samples.portions.is_ffpe",
        "cases.samples.portions.analytes.aliquots.aliquot_id",
        "cases.samples.portions.analytes.aliquots.submitter_id"
    ))
gdc_results <- results_all(gdc_query)

I would expect with this query that I should only be getting the aliquots, analytes, portions, samples, and cases that have HTSeq - Counts files. But when I drill down into the portions:

sapply(sapply(sapply(gdc_results$cases, `[[`, "samples"), `[[`, "portions"), `[[`, "portion_id")

For example the last few output elements:

$`b0cf3eb1-f9aa-4062-a61d-a0719155ecad`
[1] "3a0e4ae9-4d68-4af3-938b-c12654f29591"

$`58d983ee-8d34-4276-95d2-a660a87f1c2e`
[1] "78b36e1c-89ca-40dd-b0f7-ccb3de92d60d" "3cd08852-4369-5963-b564-fab2959b6691" "bbd3b2b8-afe1-5739-9a57-0b095e2ab66c"

$`2154c9df-56e8-4281-b917-9618ea1224dc`
[1] "b549dda5-678d-49cd-b464-609717682062" "1dd8e696-aecd-54f2-88d6-ea2c9376b219"

$`ed6d9b3b-bab4-466b-9238-6febe47cd076`
[1] "3f72044d-c463-4a11-a68b-ee4b7ea25c68"

$`f7c0dcbd-6704-41c7-8516-a83a7909d027`
[1] "20cea80d-c36d-4f79-872a-d3302280a70e"

When I examined these files with multiple portions I see the only one portion in each group has an analyte that isn't NULL and can be traced back to the file, the other portions are garbage that don't see to be attached to anything at the GDC.

$`58d983ee-8d34-4276-95d2-a660a87f1c2e`
                            portion_id                                                           analytes        submitter_id    state is_ffpe
1 78b36e1c-89ca-40dd-b0f7-ccb3de92d60d de01d9b6-e29d-495e-815a-c7d2a7c15af3, TCGA-PL-A8LZ-01A-31R-A36F-07 TCGA-PL-A8LZ-01A-31 released   FALSE
2 3cd08852-4369-5963-b564-fab2959b6691                                                               NULL                <NA>     <NA>      NA
3 bbd3b2b8-afe1-5739-9a57-0b095e2ab66c                                                               NULL                <NA>     <NA>      NA

$`2154c9df-56e8-4281-b917-9618ea1224dc`
                            portion_id                                                           analytes        submitter_id    state is_ffpe
1 b549dda5-678d-49cd-b464-609717682062 2ca5942e-dcbf-46f1-8427-1ea7216dc559, TCGA-PL-A8LY-01A-11R-A41B-07 TCGA-PL-A8LY-01A-11 released   FALSE
2 1dd8e696-aecd-54f2-88d6-ea2c9376b219                                                               NULL                <NA>     <NA>      NA

How do I get rid of these? Adding to the query filter cases.samples.portions.state == "released" is that the generally preferred way for all similar situations? Or in the OP using the appropriate filter syntax for "not missing"?

hermidalc commented 2 years ago

I added to the query filter cases.samples.portions.state == "released" but that had no affect, the portions with NULL metadata rows, NULL linked analytes, and NA state.... i.e. not really attached to anything are still included in the results.

seandavi commented 2 years ago

Unfortunately, the "lower" parts of the cases data structures are not tied to the original file query through a direct relationship. In other words, the relationship is between file and case ONLY. All the additional case data does not have any key relationship back to file, so you'll get lots of "case" stuff that is unrelated to, in this case, HTSeq-counts. Any portions attached to the case are just that and the data model for a portion does not require it to have any associated analyses, etc. Hope that helps, but feel free to reopen if I didn't get to everything here.

hermidalc commented 2 years ago

Thanks @seandavi, though for some reason I cannot reopen the issue. The OP feature request still holds? What is the filter syntax from IS MISSING and NOT MISSING operators in GDC query language? https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Advanced_Search/#is-missing-operator

hermidalc commented 2 years ago

If my previous response is a valid feature request please re-open the issue, though if not and I'm "missing" something (pun intended!) pls tell me

LiNk-NY commented 2 years ago

Hi Leandro, @hermidalc Thanks for the request. I've created PR #96 for review. Feel free to test out the "missing" branch. Best, Marcel

LiNk-NY commented 1 year ago

This should be implemented in #96