BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

TARGET-NBL sample/file duplication error when downloading star - counts #517

Open FerrenaAlexander opened 2 years ago

FerrenaAlexander commented 2 years ago

Hello, thanks for all your work on this fantastic package!

I encountered an issue downloading the Star - Counts for TARGET-NBL. It seems to be caused by more than one file per case in that cohort.

Version: packageVersion('TCGAbiolinks') #‘2.25.0’

  1. A warning in GDCquery:
    
    project='TARGET-NBL'
    GDCquery(project = project,
         data.category = "Transcriptome Profiling",
         data.type = "Gene Expression Quantification", 
         workflow.type = "STAR - Counts",
         sample.type = "Primary Tumor")

o GDCquery: Searching in GDC database

Genome of reference: hg38

oo Accessing GDC. This might take a while...

ooo Project: TARGET-NBL

oo Filtering results

ooo By data.type ooo By workflow.type ooo By sample.type

oo Checking data

ooo Check if there are duplicated cases Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio ooo Check if there results for the query

o Preparing output

   results    project           data.category                      data.type legacy access

1 c("046fc.... TARGET-NBL Transcriptome Profiling Gene Expression Quantification FALSE NA experimental.strategy file.type platform sample.type barcode workflow.type 1 NA NA NA Primary .... NA STAR - Counts


2. using above as the query, i get an error in GDCprepare:

rse <- GDCprepare(query = query)

cases experimental_strategy analysis_workflow_type
74 TARGET-30-PANKFE-01A-01R RNA-Seq STAR - Counts
75 TARGET-30-PANKFE-01A-01R RNA-Seq STAR - Counts
37 TARGET-30-PAPTFZ-01A-01R RNA-Seq STAR - Counts
129 TARGET-30-PAPTFZ-01A-01R RNA-Seq STAR - Counts
21 TARGET-30-PASYPX-01A-01R RNA-Seq STAR - Counts
98 TARGET-30-PASYPX-01A-01R RNA-Seq STAR - Counts

Error in GDCprepare(query = query) : There are samples duplicated. We will not be able to prepare it



I understand it seems to be due to some duplicated case accessions, in particular there are more than one file for these cases. I can find the duplicated rows but I don't know why they are duplicated,  I inspected them as below.

res <- getResults(query)
dups <- res$cases[which(duplicated(res$cases))]
res[res$cases %in% dups,]

Do you have any suggestions, or can you suggest how to proceed without these duplicate-file cases?

Thank you!
tiagochst commented 2 years ago

I need to check this. Some of the samples have mixed samples.

Screen Shot 2022-06-15 at 5 23 07 AM
tiagochst commented 2 years ago

I need to check why sample.submitter_id field has only one sample for the mixed samples.

Screen Shot 2022-06-15 at 5 26 22 AM
callisto111 commented 2 years ago

I'm having the same issue. Have u found a way to solve the issue or access to the query data and manually remove the duplicated samples??

Thank you! 1656943536(1)

tiagochst commented 2 years ago

@callisto111 Do you have the latest version installed from github ? This issue was solved: https://rpubs.com/tiagochst/TCGAbiolinks_517

callisto111 commented 2 years ago

@tiagochst I am a beginner. The version of my tcgabiolinks is 2.25.0, but yours is 2.25.2. I tried many ways but failed to update it, and this is still an error with duplicated elements. I really don't know what should i do. Thank you in advance. 1 2

callisto111 commented 2 years ago

@tiagochst I have solved this problem. Thanks for your advice.