BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
284 stars 109 forks source link

TCGAbiolinks download repeat clinical data and case count not the same with getProjectSummary #346

Open wentgithub opened 4 years ago

wentgithub commented 4 years ago

rm(list=ls()) library(TCGAbiolinks) library(dplyr) library(DT) query <- GDCquery(project = "TCGA-OV", data.category = "Clinical", file.type = "xml") GDCdownload(query) clinical <- GDCprepare_clinic(query, clinical.info = "patient") write.table(clinical, 'b.txt', sep="\t", row.names=F,col.names=T,quote=F)

you will see it has three duplicate bcr_patient_barcode

TCGA-3P-A9WA TCGA-59-A5PD TCGA-5X-AA5U ,waiting for your help, thanks a lot

============================================================================================================================ here is the resulr of TCGA-OV, you give clinical case_count is 608, but as the above getthing clinical data, even including the three repeat data, it is 590, not 608, so what is the real case count image

tiagochst commented 4 years ago

I'll check the count and the duplicated sample. I did not touch that function for a long time.

wentgithub commented 4 years ago

thanks a lot, waiting for your reply

tiagochst commented 4 years ago

It seems TCGA-OV has 608 cases, but only 587 have clinical data. The numbers are the same from GDC data portal as shown below:

Screen Shot 2019-09-03 at 10 41 05 AM Screen Shot 2019-09-03 at 10 47 48 AM

Example of case missing clinical data:

Screen Shot 2019-09-03 at 10 55 10 AM

Code to check the samples missing: http://rpubs.com/tiagochst/TCGA-OV-cases

wentgithub commented 4 years ago

Thanks a lot. so what is the right way to get clinical data? query <- GDCquery(project = "TCGA-OV", data.category = "Clinical", file.type = "xml") GDCdownload(query) clinical <- GDCprepare_clinic(query, clinical.info = "patient") # and then dedup myself?

tiagochst commented 4 years ago

I suggest using the indexed data: clinical.indexed <- GDCquery_clinic(project = "TCGA-OV", type = "clinical")

wentgithub commented 4 years ago

it is really a pity, use the same code, I get not the same data as you

image

after write.table(clinical.indexed,"ov_clinical",sep="\t",row.names=F,col.names=T,quote=F) and you can see the result is double rownames, and most content is NA

image

tiagochst commented 4 years ago

Which version of TCGAbiolinks do you have installed? It seems it is an old one. Could you please update it from Github with:

withr::with_envvar(c(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true"), 
  remotes::install_github('BioinformaticsFMRP/TCGAbiolinks')
)
wentgithub commented 4 years ago

I install from the bioconductor before. after installing from your suggestion, the counts now is right, but the content is of courese not I wanted at least, it lacks tumor stage information, image

here is my code rm(list=ls()) library(TCGAbiolinks) library(dplyr) library(DT) clinical.indexed <- GDCquery_clinic(project = "TCGA-OV", type = "clinical")

where I am wrong, Thanks a lot

tiagochst commented 4 years ago

The indexed data is parsed from the XML files. It seems there is a problem with the parsing. You can get that information in the Biotab or XML.

query <- GDCquery(project = "TCGA-OV", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
clinical.BCRtab.all$clinical_patient_ov$tumor_grade
query <- GDCquery(project = "TCGA-OV", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
clinical.BCRtab.all$clinical_patient_ov$clinical_stage
Screen Shot 2019-09-05 at 9 56 10 AM
wentgithub commented 4 years ago

I am so sorry, no matter install tcgabiolink from bioconductor or the method

withr::with_envvar(c(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true"), remotes::install_github('BioinformaticsFMRP/TCGAbiolinks') ) you supplied.

running code

rm(list=ls()) library(TCGAbiolinks) library(dplyr) library(DT)

query <- GDCquery(project = "TCGA-OV", data.category = "Clinical", data.type = "Clinical Supplement", data.format = "BCR Biotab") GDCdownload(query) clinical.BCRtab.all <- GDCprepare(query)

both will report the same error image so I check the function , here also says no argument image sorry for disturbing you so many times, hope package TCGAbiolinks will become a more outstanding package for analysing tcga data