Closed ranijames closed 2 years ago
Unfortunately, the manifest file is specific to the GDC download tool, so it contains only the five columns above. Instead, try something like:
ge_manifest = files() %>%
filter( cases.project.project_id == 'TCGA-BRCA' & cases.samples.sample_type=='Solid Tissue Normal') %>%
filter( type == 'gene_expression' ) %>%
filter( analysis.workflow_type == 'HTSeq - Counts') %>%
expand('cases.samples') %>% results_all() %>%
as_tibble()
Then, we need to do two levels of unnesting due to the nested nature of the data coming back from the API.
full_metadata = ge_manifest %>% tidyr::unnest(cases) %>%
tidyr::unnest(samples, names_sep="_")
Result:
# A tibble: 113 x 42
state experimental_st… md5sum type data_type samples_state samples_days_to… samples_initial…
<chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl>
1 rele… RNA-Seq e6771… gene… Gene Exp… released 42 330
2 rele… RNA-Seq 18025… gene… Gene Exp… released 2669 530
3 rele… RNA-Seq 69f47… gene… Gene Exp… released 3879 180
4 rele… RNA-Seq 57daa… gene… Gene Exp… released 5893 920
5 rele… RNA-Seq 67e69… gene… Gene Exp… released 287 160
6 rele… RNA-Seq 92526… gene… Gene Exp… released 121 390
7 rele… RNA-Seq 7950e… gene… Gene Exp… released 275 260
8 rele… RNA-Seq 97341… gene… Gene Exp… released 2639 520
9 rele… RNA-Seq 3d36f… gene… Gene Exp… released 1391 930
10 rele… RNA-Seq 24dc8… gene… Gene Exp… released 3629 310
# … with 103 more rows, and 34 more variables: samples_current_weight <lgl>,
# samples_time_between_excision_and_freezing <lgl>, samples_pathology_report_uuid <lgl>,
# samples_tumor_code_id <lgl>, samples_shortest_dimension <lgl>,
# samples_freezing_method <lgl>, samples_updated_datetime <chr>, samples_oct_embedded <chr>,
# samples_days_to_sample_procurement <lgl>, samples_intermediate_dimension <lgl>,
# samples_sample_type_id <chr>, samples_tissue_type <chr>, samples_tumor_descriptor <lgl>,
# samples_preservation_method <lgl>, samples_tumor_code <lgl>,
# samples_longest_dimension <lgl>, samples_time_between_clamping_and_freezing <lgl>,
# samples_sample_id <chr>, samples_composition <lgl>, samples_created_datetime <lgl>,
# samples_sample_type <chr>, samples_is_ffpe <lgl>, samples_submitter_id <chr>,
# file_size <int>, file_id <chr>, updated_datetime <chr>, data_format <chr>, file_name <chr>,
# data_category <chr>, acl <named list>, id <chr>, created_datetime <chr>, access <chr>,
# submitter_id <chr>
Thanks for the reply. It is not clear again where is it specified about the normal and tumor samples in the resulting column? The current script snippet is throwing the error,
Error in UseMethod("expand_") :
no applicable method for 'expand_' applied to an object of class
"c('gdc_files', 'GDCQuery', 'list')"In addition: Warning
message:`expand_()` is deprecated as of tidyr 1.0.0.
Please use `expand()` instead
On Tue, May 26, 2020, at 9:08 PM Sean Davis notifications@github.com wrote:
Unfortunately, the manifest file is specific to the GDC download tool, so it contains only the five columns above. Instead, try something like:
ge_manifest = files() %>% filter( cases.project.project_id == 'TCGA-BRCA' & cases.samples.sample_type=='Solid Tissue Normal') %>% filter( type == 'gene_expression' ) %>% filter( analysis.workflow_type == 'HTSeq - Counts') %>% expand('cases.samples') %>% results_all() %>% as_tibble()
Then, we need to do two levels of unnesting due to the nested nature of the data coming back from the API.
full_metadata = ge_manifest %>% tidyr::unnest(cases) %>% tidyr::unnest(samples, namessep="")
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/GenomicDataCommons/issues/77#issuecomment-634219610, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4I6JLROJGSAYTY5NI6RULRTQHTJANCNFSM4NKOWYYA .
-- Kind Regards, Dr.Alva Rani James
@ranijames the issue you were having is due to library import namespace clobbering between GenomicDataCommons
and tidyr
. Honestly it's was R issue until 4.0 where now you can import only what you want
library(dplyr, include.only = c("select", "mutate"))
But separate from that, in @seandavi query in R 3.6 it works for me without issue if you only import GenomicDataCommons
and then tidyr::as_tibble()
:
library(GenomicDataCommons)
ge_manifest = files() %>%
filter( cases.project.project_id == 'TCGA-BRCA' & cases.samples.sample_type=='Solid Tissue Normal') %>%
filter( type == 'gene_expression' ) %>%
filter( analysis.workflow_type == 'HTSeq - Counts') %>%
expand('cases.samples') %>% results_all() %>%
tidyr::as_tibble()
Thanks, @hermidalc, for pointing out the namespace issues. The best practice is to use GenomicDataCommons::filter
rather than just filter
so as to avoid the multiple other filter
s that are out there.
Thanks, @hermidalc, for pointing out the namespace issues. The best practice is to use
GenomicDataCommons::filter
rather than justfilter
so as to avoid the multiple otherfilter
s that are out there.
Same for expand
too because tidyr
masks it
Hello All, Thanks for the GDC package. I have a rather simple question. I would like to know if there is a way to get the field sample_type in the manifest file corresponding to each sample. For example for each file_id or filename which type is it, whether or not normal or tumor. Here is the piece of code for downloading both normal and tumor breast cancer samples, and in the manifest file I would like to have an indication which of those files are tumor and which are normal.
Currently, the headers are the following,
Thanks for your help and support!