BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
298 stars 112 forks source link

error in GDCprepare #470

Open ch8316f5eyu opened 3 years ago

ch8316f5eyu commented 3 years ago

I encountered an error in GDCprepare. There are the codes:

query.exp <- GDCquery(project = 'CPTAC-3', legacy = F, data.category = "Transcriptome Profiling", data.type = 'Gene Expression Quantification', workflow.type = 'HTSeq - Counts', experimental.strategy = "RNA-Seq") GDCdownload(query.exp) x = GDCprepare(query = query.exp, save = T, save.filename = paste0('~/project/cancer/TCGA/exp_CPTAC-3.rda')) The error is after GDCprepare:

|==================================================================================================================================|100% Completed after 1 m Starting to add information to samples => Add clinical information to samples Error in xj[i] : invalid subscript type 'list' Thanks.

tiagochst commented 3 years ago

GDCquery breaks for CPTAC-3 due to mixed samples.

query.exp <- GDCquery(
    project = 'CPTAC-3', 
    legacy = F,
    data.category = "Transcriptome Profiling", 
    data.type = 'Gene Expression Quantification',
    workflow.type = 'HTSeq - Counts', 
    experimental.strategy = "RNA-Seq"
)

query.exp$results[[1]] <- query.exp$results[[1]][1:100,]
GDCdownload(query.exp,files.per.chunk = 100) 
x <- GDCprepare(query = query.exp, save = F)
Screen Shot 2021-10-24 at 8 25 18 PM
huiyijiangling commented 2 years ago

Same bugs in CPTAC-3 using GDCprepare please help!

query.exp = GDCquery(project = "CPTAC-3", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", data.format="TSV", workflow.type = "STAR - Counts") GDCdownload(query.exp, method = "api", files.per.chunk = 10) Downloading data for project CPTAC-3 Of the 1883 files for download 1883 already exist. All samples have been already downloaded

pre.exp = GDCprepare(query = query.exp) |===============================================================================================================================|100% Completed after 1 m Error in levels<-(*tmp*, value = as.character(levels)) : factor level [81] is duplicated

tiagochst commented 2 years ago

@huiyijiangling Thank you for reporting this bug. It seems CPTAC-3 barcode does not differ replicates as the other projects, but I need to double check it. For example, C3N-02765-02 has 4 files with counts.

Screen Shot 2022-08-01 at 9 20 58 AM Screen Shot 2022-08-01 at 9 20 48 AM
tiagochst commented 2 years ago

@huiyijiangling There are 28 duplicated samples. For the moment, you can remove those samples and the code should work. I need to think more how to deal with this case without breaking the other one projects and parts of the code. Probably I will need to concatenate sample and analytes id for CPTAC-3. (i.e.C3N-02765-02_CPT0184450060 instead of C3N-02765-02)

query.exp <- GDCquery(
        project = 'CPTAC-3',
        legacy = F,
        data.category = "Transcriptome Profiling",
        data.type = 'Gene Expression Quantification',
        workflow.type = "STAR - Counts"
    )
# remove duplicated
query.exp$results[[1]] <- query.exp$results[[1]][!duplicated(query.exp$results[[1]]$sample.submitter_id),]

GDCdownload(query.exp,files.per.chunk = 40)
se <- GDCprepare(
    query = query.exp,
    save = F
)
huiyijiangling commented 2 years ago

@huiyijiangling There are 28 duplicated samples. For the moment, you can remove those samples and the code should work. I need to think more how to deal with this case without breaking the other one projects and parts of the code. Probably I will need to concatenate sample and analytes id for CPTAC-3. (i.e.C3N-02765-02_CPT0184450060 instead of C3N-02765-02)

query.exp <- GDCquery(
        project = 'CPTAC-3',
        legacy = F,
        data.category = "Transcriptome Profiling",
        data.type = 'Gene Expression Quantification',
        workflow.type = "STAR - Counts"
    )
# remove duplicated
query.exp$results[[1]] <- query.exp$results[[1]][!duplicated(query.exp$results[[1]]$sample.submitter_id),]

GDCdownload(query.exp,files.per.chunk = 40)
se <- GDCprepare(
    query = query.exp,
    save = F
)

Thank you for your solution for reducing duplicated samples. CPTAC-3 often uses mixed samples in RNA-seq and protein expression quantification for QC or increasing content of tissue, which has different filenames but barcode/submitted_case_id/submitted_sample_id are not unique. I will take your solution for reducing duplicated samples, and I'm looking forward to see the problems fixed in the next version. Thank you again!

yiyisun682 commented 2 years ago

Hello! When I use the same code to download TARGET-AML datasets, which also have duplicated samples, I got the same error. query.exp <- GDCquery( project = 'TARGET-AML', legacy = F, data.category = "Transcriptome Profiling", data.type = 'Gene Expression Quantification', workflow.type = "STAR - Counts" )

remove duplicated

query.exp$results[[1]] <- query.exp$results[[1]][!duplicated(query.exp$results[[1]]$sample.submitter_id
),]

GDCdownload(query.exp,files.per.chunk = 40)
se <- GDCprepare(
    query = query.exp,
    save = F
)
yiyisun682 commented 2 years ago

The error messages are as follows:

yiyisun682 commented 2 years ago

=> Add clinical information to samples Error in .rowNamesDF<-(x, value = value) : invalid 'row.names' length

itscarolnunes commented 1 year ago

The error messages are as follows:

I have the same error to the same project did you find a solution ?

jaygamma commented 1 year ago

Apologies in advance for my speculating - I don't have the most experience with code!

Also having this issue, here's a traceback:

Starting to add information to samples
Adding description to TARGET samples
Warning: Expected 5 pieces. Additional pieces discarded in 187 rows [57, 74, 77, 90, 95, 215, 240, 244, 279, 296, 313, 406, 411, 445, 453, 492, 498, 505, 507, 529, ...]. => Add clinical information to samples
Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
> traceback()
9: stop("invalid 'row.names' length")
8: `.rowNamesDF<-`(x, value = value)
7: `row.names<-.data.frame`(`*tmp*`, value = value)
6: `row.names<-`(`*tmp*`, value = value)
5: `rownames<-`(`*tmp*`, value = barcode)
4: colDataPrepare(cases)
3: makeSEfromTranscriptomeProfilingSTAR(data = df, cases = cases)
2: readTranscriptomeProfiling(files = files, data.type = ifelse(!is.na(query$data.type), 
       as.character(query$data.type), unique(query$results[[1]]$data_type)), 
       workflow.type = unique(query$results[[1]]$analysis_workflow_type), 
       cases = cases, summarizedExperiment)
1: GDCprepare(query)

Trying to follow this back through the GDCPrepare source code, the colDataPrepare function is correctly identifying the samples as TARGET samples, and calls "colDataPrepareTARGET" as evidenced by "Adding description to TARGET samples" output from within that function. Somewhere within that function the code is expecting 5 pieces and drops the extra (as seen in the warning) then proceeds through the remainder of colDataPrepare (as evidenced by the "Adding clinical data to samples" output.

Running debug(colDataPrepare) I can see that DFrame 'ret' being returned by colDataPrepareTARGET has rows of NAs where the warning indicates data was dropped. This then proceeds to the last row of colDataPrepare, where the dataframe row.names are set to sample barcodes - issue being you have the original X number of barcodes you passed to colDataPrepareTARGET, which then returned (X-187) samples, and you're trying to set row.names of an (X-187) dataframe with a list of X, throwing the invalid row.names length error.

I believe this is happening because of the following code in colDataPrepareTARGET: regex <- paste0("[:alnum:]{5}-[:alnum:]{2}-[:alnum:]{6}", "-[:alnum:]{3}-[:alnum:]{3}") samples <- str_match(barcode,regex)[,1]

Where the sample IDs are screened for the TARGET formatting - 5 alphanumeric characters followed by 2, followed by 6, then 3, then 3. This is indeed the format of target IDs such as "TARGET-20-PARNFZ-03A-01R" however the TARGET-AML database also includes some samples formatted like "TARGET-20-PAYHMK-Sorted-leukemic-09A-01R" which would not match the regex. There just so happen to be 187 of these in my query - matching the 187 discarded rows in the warning.

@tiagochst I assume these functions were written while all TARGET-AML samples matched that regex, and this error didn't exist. I'm not sure how to circumvent this issue short of dropping the 187 samples (which I'd rather not do!) and in all honesty I'm not 100% sure how drop those 187 specific samples from the query before attempting to prepare it. Any hope for a solution?

jaygamma commented 1 year ago

@yiyisun682 @itscarolnunes you can bypass the error you're experiencing by running the following code immediately before GDCPrepare():

query$results[[1]] <- query$results[[1]] %>% filter(nchar(cases)==24)

This will pull your query results out, filter them for case IDs exactly 24 characters long (and therefore correctly formatted to pass the regex check) and set the query list to the filtered IDs.

Doing this reduced my query size from 3064 cases to 2809, but GDCPrepare() then completes without error.