Bioconductor / GenomicDataCommons

Provide R access to the NCI Genomic Data Commons portal.
http://bioconductor.github.io/GenomicDataCommons/
83 stars 23 forks source link

TCGA data ids not matching in cases() and files() #86

Closed cpguy101 closed 2 years ago

cpguy101 commented 3 years ago

Hi,

I'd filtered the data to get a manifest file ['gene_expression', 'HTSeq - Counts', 'Solid Tissue Normal']

Data: TCGA-BRCA

stn_manifest = files() %>%
  filter( cases.project.project_id == 'TCGA-BRCA') %>% 
  filter( type == 'gene_expression' ) %>%
  filter( analysis.workflow_type == 'HTSeq - Counts')  %>%
  filter( cases.samples.sample_type=='Solid Tissue Normal') %>%
  manifest()

# A tibble: 6 x 5
  id                              filename                                     md5                          size state  
  <chr>                           <chr>                                        <chr>                       <dbl> <chr>  
1 7b49680e-d7a7-4b6c-8763-5d0d26~ 5fa89257-f0c1-43ed-99d6-991870b4e422.htseq.~ a8ef05be595d66a7310560ec9~ 256322 releas~
2 2f17f6de-2278-4d1e-ac84-524bf5~ a1a68fe9-9635-4b7f-b9a2-34474ef8c1dc.htseq.~ e71a3b244ae9e0bf3ac8290db~ 253788 releas~
3 cefaebf8-5419-4ddd-8dc1-e2867d~ 69878c97-02ab-4504-9506-aea3adbee455.htseq.~ ee1f455f63189b77c4cd9fc9a~ 254134 releas~
4 0c9bc998-17a7-42cf-b4d2-b8a5df~ 338b5431-af4d-41ce-855f-998a293c3680.htseq.~ 93698aff04fe34693181632e7~ 257815 releas~
5 078f7608-b8f7-4827-b752-adfed6~ e29ce54a-49a1-47a8-82fd-7687cec0d1bb.htseq.~ b7aa3478b5a7026b1d5a75970~ 256217 releas~
6 f821249b-0738-49a8-89e6-4756e8~ b41174a5-4db8-438c-a9fa-6da8c08a9c75.htseq.~ 3a3f7f6fb4085dcfac27509bd~ 253861 releas~

Why is this id output different from using cases() ? (I'd checked this using filter() function earlier)

resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
                            type == 'gene_expression' &
                            samples.sample_type=='Solid Tissue Normal' &
                            analysis.workflow_type == 'HTSeq - Counts') %>%
  GenomicDataCommons::select(c(default_fields(cases()),'samples.sample_type')) %>%
  response_all()

> head(resp$results$id)
[1] "5cdae21d-eee5-478f-932a-0f51fcf5f031" "8c09f413-e938-4f2e-a414-84f0e7fcfe41" "d6f911b5-e895-43f8-8f86-0ac2f1bc6fae"
[4] "fa176764-a76f-44c7-b97a-cd6d21e052be" "5d1d00c6-fcae-479e-ae1e-de76efd41d98" "cc074b7f-d3b2-4880-902e-bf10e667b665"

I expected the ids to be the same since I was intending to filter and merge the dataframes. Are the ids generated using cases() different from those generated using files()?

LiNk-NY commented 2 years ago

Sorry for the late response. Please tag me or seandavi for a quicker response. Case UUIDs are different from file UUIDs. You'd have to translate fileUUIDs to caseUUIDs. Perhaps TCGAutils::UUIDtoUUID can help with this. Best, Marcel