Bioconductor / GenomicDataCommons

Provide R access to the NCI Genomic Data Commons portal.
http://bioconductor.github.io/GenomicDataCommons/
83 stars 23 forks source link

How to get the type of file normal or tumor in the manifest file #77

Closed ranijames closed 2 years ago

ranijames commented 4 years ago

Hello All, Thanks for the GDC package. I have a rather simple question. I would like to know if there is a way to get the field sample_type in the manifest file corresponding to each sample. For example for each file_id or filename which type is it, whether or not normal or tumor. Here is the piece of code for downloading both normal and tumor breast cancer samples, and in the manifest file I would like to have an indication which of those files are tumor and which are normal.

ge_manifest = files() %>%
    filter( cases.project.project_id == 'TCGA-BRCA' & cases.samples.sample_type=='Solid Tissue Normal') %>% 
    filter( type == 'gene_expression' ) %>%
    filter( analysis.workflow_type == 'HTSeq - Counts')  %>%
    manifest()

Currently, the headers are the following,

id                 filename                    md5                size state

Thanks for your help and support!

seandavi commented 4 years ago

Unfortunately, the manifest file is specific to the GDC download tool, so it contains only the five columns above. Instead, try something like:

ge_manifest = files() %>%
    filter( cases.project.project_id == 'TCGA-BRCA' & cases.samples.sample_type=='Solid Tissue Normal') %>% 
    filter( type == 'gene_expression' ) %>%
    filter( analysis.workflow_type == 'HTSeq - Counts') %>% 
    expand('cases.samples') %>% results_all() %>% 
    as_tibble()

Then, we need to do two levels of unnesting due to the nested nature of the data coming back from the API.

full_metadata = ge_manifest %>% tidyr::unnest(cases) %>% 
    tidyr::unnest(samples, names_sep="_")

Result:

# A tibble: 113 x 42
   state experimental_st… md5sum type  data_type samples_state samples_days_to… samples_initial…
   <chr> <chr>            <chr>  <chr> <chr>     <chr>                    <int>            <dbl>
 1 rele… RNA-Seq          e6771… gene… Gene Exp… released                    42              330
 2 rele… RNA-Seq          18025… gene… Gene Exp… released                  2669              530
 3 rele… RNA-Seq          69f47… gene… Gene Exp… released                  3879              180
 4 rele… RNA-Seq          57daa… gene… Gene Exp… released                  5893              920
 5 rele… RNA-Seq          67e69… gene… Gene Exp… released                   287              160
 6 rele… RNA-Seq          92526… gene… Gene Exp… released                   121              390
 7 rele… RNA-Seq          7950e… gene… Gene Exp… released                   275              260
 8 rele… RNA-Seq          97341… gene… Gene Exp… released                  2639              520
 9 rele… RNA-Seq          3d36f… gene… Gene Exp… released                  1391              930
10 rele… RNA-Seq          24dc8… gene… Gene Exp… released                  3629              310
# … with 103 more rows, and 34 more variables: samples_current_weight <lgl>,
#   samples_time_between_excision_and_freezing <lgl>, samples_pathology_report_uuid <lgl>,
#   samples_tumor_code_id <lgl>, samples_shortest_dimension <lgl>,
#   samples_freezing_method <lgl>, samples_updated_datetime <chr>, samples_oct_embedded <chr>,
#   samples_days_to_sample_procurement <lgl>, samples_intermediate_dimension <lgl>,
#   samples_sample_type_id <chr>, samples_tissue_type <chr>, samples_tumor_descriptor <lgl>,
#   samples_preservation_method <lgl>, samples_tumor_code <lgl>,
#   samples_longest_dimension <lgl>, samples_time_between_clamping_and_freezing <lgl>,
#   samples_sample_id <chr>, samples_composition <lgl>, samples_created_datetime <lgl>,
#   samples_sample_type <chr>, samples_is_ffpe <lgl>, samples_submitter_id <chr>,
#   file_size <int>, file_id <chr>, updated_datetime <chr>, data_format <chr>, file_name <chr>,
#   data_category <chr>, acl <named list>, id <chr>, created_datetime <chr>, access <chr>,
#   submitter_id <chr>
ranijames commented 4 years ago

Thanks for the reply. It is not clear again where is it specified about the normal and tumor samples in the resulting column? The current script snippet is throwing the error,

Error in UseMethod("expand_") :
  no applicable method for 'expand_' applied to an object of class
"c('gdc_files', 'GDCQuery', 'list')"In addition: Warning
message:`expand_()` is deprecated as of tidyr 1.0.0.
Please use `expand()` instead

On Tue, May 26, 2020, at 9:08 PM Sean Davis notifications@github.com wrote:

Unfortunately, the manifest file is specific to the GDC download tool, so it contains only the five columns above. Instead, try something like:

ge_manifest = files() %>% filter( cases.project.project_id == 'TCGA-BRCA' & cases.samples.sample_type=='Solid Tissue Normal') %>% filter( type == 'gene_expression' ) %>% filter( analysis.workflow_type == 'HTSeq - Counts') %>% expand('cases.samples') %>% results_all() %>% as_tibble()

Then, we need to do two levels of unnesting due to the nested nature of the data coming back from the API.

full_metadata = ge_manifest %>% tidyr::unnest(cases) %>% tidyr::unnest(samples, namessep="")

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/GenomicDataCommons/issues/77#issuecomment-634219610, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4I6JLROJGSAYTY5NI6RULRTQHTJANCNFSM4NKOWYYA .

-- Kind Regards, Dr.Alva Rani James

hermidalc commented 2 years ago

@ranijames the issue you were having is due to library import namespace clobbering between GenomicDataCommons and tidyr. Honestly it's was R issue until 4.0 where now you can import only what you want

library(dplyr, include.only = c("select", "mutate"))

But separate from that, in @seandavi query in R 3.6 it works for me without issue if you only import GenomicDataCommons and then tidyr::as_tibble():

library(GenomicDataCommons)

ge_manifest = files() %>%
   filter( cases.project.project_id == 'TCGA-BRCA' & cases.samples.sample_type=='Solid Tissue Normal') %>%
   filter( type == 'gene_expression' ) %>%
   filter( analysis.workflow_type == 'HTSeq - Counts') %>%
   expand('cases.samples') %>% results_all() %>%
   tidyr::as_tibble()
seandavi commented 2 years ago

Thanks, @hermidalc, for pointing out the namespace issues. The best practice is to use GenomicDataCommons::filter rather than just filter so as to avoid the multiple other filters that are out there.

hermidalc commented 2 years ago

Thanks, @hermidalc, for pointing out the namespace issues. The best practice is to use GenomicDataCommons::filter rather than just filter so as to avoid the multiple other filters that are out there.

Same for expand too because tidyr masks it