BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

GDCprepare unable to integrate CPTAC-3 methylation data #491

Open shadihames opened 2 years ago

shadihames commented 2 years ago

I've been trying to use TCGAbiolinks to download the CPTAC data but I've run into a problem with the methylation data. I have been able to download the data but I can't integrate it together using GDCprepare(). I think the issue is with the number of columns in the downloaded data. I've used TCGAbiolinks previously to download the TCGA methylation data and it's worked fine.

CPTAC_3 <- GDCquery(project = "CPTAC-3", data.category = "DNA Methylation", platform = "Illumina Methylation Epic", data.type = "Methylation Beta Value") GDCdownload(CPTAC_3, method = "api") GDCprepare(CPTAC_3)

Error in fread(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE, : colClasses= is an unnamed vector of types, length 5, but there are 2 columns in the input. To specify types for a subset of columns, you can use a named vector, list format, or specify types using select= instead of colClasses=. Please see examples in ?fread. Error in if (value == n) { : argument is of length zero

If I look at one of the files that has been downloaded there are only 2 columns without column names, eg: cg18478105 0.0116882851338117

However, if I look at one of the files that has been downloaded for the TCGA data there are 13 columns which have column names, eg: Composite Element REF Beta_value Chromosome Start End Gene_Symbol Gene_Type Transcript_ID Position_to_TSS CGI_Coordinate Feature_Type

shadihames commented 2 years ago

I've been looking more into this issue and it seems like a relatively simple fix (if I've gotten the issue correct).

In prepare.R on line 741 it pulls out files from the platforms with OMA00 in the description, and I think the CPTAC-3 methylation data fits that format rather than the hg38 format it's being recognised as.

The only available platform when you GDCquery() for the CPTAC-3 methylation data is 'Illumina Methylation EPIC' but I think the format of the files match the OMA00 platforms with the colClasses argument being set to character and numeric, which are the correct columns when you look at the files separately.

troysgit commented 1 year ago

A script of mine that previously worked (two months ago) is also now getting a similar error accessing Methylation data from TCGA

in fread(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE,  : 
  colClasses= is an unnamed vector of types, length 5, but there are 2 columns in the input. To specify types for a subset of columns, you can use a named vector, list format, or specify types using select= instead of colClasses=. Please see examples in ?fread.
Error in if (value == n) { : argument is of length zero