Bioconductor / GenomicDataCommons

Provide R access to the NCI Genomic Data Commons portal.
http://bioconductor.github.io/GenomicDataCommons/
84 stars 23 forks source link

Making GenomicDataCommons less daunting to use for the average user #94

Open hermidalc opened 2 years ago

hermidalc commented 2 years ago

GenomicDataCommons is a very powerful library and can query pretty much anything at the GDC, though many users prefer other libraries like e.g.,TCGAbiolinks because the GenomicDataCommons query results data structure, a deeply nested list of recursive lists of data frames of lists (an R representation of JSON), can be quite daunting for the average user to work with.

Users generally want to get and look at data frames. While all query results cannot easily be transformed into a single data frame, many can. It would help users a lot to show how to do this.

I use rrapply to recursively alter anything I need inside the query results data structure (which is a nice library made for dealing with things like GenomicDataCommons recursive data structures), then make a data frame like e.g.:

stopifnot(GenomicDataCommons::status()$status == "OK")
gdc_query <-
    files() %>%
    GenomicDataCommons::filter(
        cases.project.project_id %in% project_ids
        & cases.samples.sample_type %in% sample_types
        & analysis.workflow_type == workflow_type
    ) %>%
    GenomicDataCommons::select(c(
        "file_name",
        "analysis.workflow_type",
        "cases.project.project_id",
        "cases.case_id",
        "cases.submitter_id",
        "cases.samples.sample_id",
        "cases.samples.submitter_id",
        "cases.samples.sample_type",
        "cases.samples.is_ffpe",
        "cases.samples.portions.is_ffpe",
        "cases.samples.portions.analytes.aliquots.aliquot_id",
        "cases.samples.portions.analytes.aliquots.submitter_id"
    ))
gdc_results <- results_all(gdc_query)

gdc_results <- rrapply(
    gdc_results, f=function(x) NA, condition=is.null, how="replace"
)

gdc_df <- data.frame(
    file_uuid=gdc_results$file_id,
    file_name=gdc_results$file_name,
    workflow_type=gdc_results$analysis$workflow_type,
    project_id=vapply(
        sapply(gdc_results$cases, `[[`, "project"), `[`, "project_id"
    ),
    case_uuid=sapply(gdc_results$cases, `[[`, "case_id"),
    case_submitter_id=sapply(gdc_results$cases, `[[`, "submitter_id"),
    sample_uuid=sapply(
        sapply(gdc_results$cases, `[[`, "samples"), `[[`, "sample_id"
    ),
    sample_submitter_id=sapply(
        sapply(gdc_results$cases, `[[`, "samples"), `[[`, "submitter_id"
    ),
    sample_type=sapply(
        sapply(gdc_results$cases, `[[`, "samples"), `[[`, "sample_type"
    ),
    sample_is_ffpe=sapply(
        sapply(gdc_results$cases, `[[`, "samples"), `[[`, "is_ffpe"
    ),
    portion_is_ffpe=sapply(
        sapply(
            sapply(
                gdc_results$cases, `[[`, "samples"
            ), `[[`, "portions"
        ), `[[`, "is_ffpe"
    ),
    aliquot_uuid=sapply(
        sapply(
            sapply(
                sapply(
                    sapply(
                        gdc_results$cases, `[[`, "samples"
                    ), `[[`, "portions"
                ), `[[`, "analytes"
            ), `[[`, "aliquots"
        ), `[[`, "aliquot_id"
    ),
    aliquot_submitter_id=sapply(
        sapply(
            sapply(
                sapply(
                    sapply(
                        gdc_results$cases, `[[`, "samples"
                    ), `[[`, "portions"
                ), `[[`, "analytes"
            ), `[[`, "aliquots"
        ), `[[`, "submitter_id"
    ),
    row.names=gdc_results$file_id,
    stringsAsFactors=FALSE
)

For single query results that cannot be easily transformed back to a data frame using the above method (due to GDC DB key relationship structure), I make multiple queries, transform them to individual data frames like above, and then do joining of the data frames to get a single one the way I need it.

Anyway, this is probably a bit challenging for the average user? Maybe there is an easier way to work with GenomicDataCommons that I've totally missed. But if not, and what I'm doing is generally a good way, maybe it's worth adding to the vignette some examples of how to make data frames from query results.