TCGAbiolinks introduces duplicate rownames?

alsmnn commented 5 years ago

Hello, I recently ran into some problems with TCGAbiolinks. I used the TCGAbiolinks package in april 2018 to download the bladder cancer set and ran some analysis. Unfortunately I don't know which version I used at that time. Today I wanted to download the TCGA-BLCA set again, with TCGAbiolinks version 2.10.4 from BioConductor 3.8 and I have slightly different results regarding the number of genes. To clarify TCGA_BLCA is the cached data set from april 2018 and test_tcga_blca is the same data set from today.

> TCGA_BLCA
class: RangedSummarizedExperiment 
dim: 56830 433 
metadata(1): data_release
assays(1): HTSeq - Counts
rownames(56830): ENSG00000000003 ENSG00000000005 ... ENSG00000281912 ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name original_ensembl_gene_id
colnames(433): TCGA-XF-A8HD-01A-11R-A36F-07 TCGA-DK-AA6W-01A-12R-A39I-07 ... TCGA-GD-A76B-01A-11R-A32O-07 TCGA-ZF-AA4T-01A-11R-A38B-07
colData names(229): sample patient ... subtype_Fusion.in.TNFRSF21 subtype_Fusion.in.ASIP

> test_tcga_blca
class: RangedSummarizedExperiment 
dim: 56925 433 
metadata(1): data_release
assays(1): HTSeq - Counts
rownames(56925): ENSG00000000003 ENSG00000000005 ... ENSG00000281912 ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name original_ensembl_gene_id
colnames(433): TCGA-GD-A2C5-01A-12R-A180-07 TCGA-BT-A42F-01A-11R-A23W-07 ... TCGA-GU-A766-01A-11R-A32O-07 TCGA-GU-A42R-01A-11R-A23N-07
colData names(230): sample patient ... subtype_Fusion.in.TNFRSF21 subtype_Fusion.in.ASIP

You can see, that the first data set has 56830 genes, while the second data set has 56925. This are 95 genes more than in the old data set. The colData() is also in another order than in the old one. The next step would be to use the DESeqDataSet() function, for differentially expression analysis. Doing this with the second data set results in the following warning message:

> test_tcga_blca_dds <- DESeqDataSet(test_tcga_blca, 
+                                  design = ~ shortLetterCode)
renaming the first element in assays to 'counts'
converting counts to integer mode
Warning messages:
1: In DESeqDataSet(test_tcga_blca, design = ~shortLetterCode) :
  209 duplicate rownames were renamed by adding numbers
2: In DESeqDataSet(test_tcga_blca, design = ~shortLetterCode) :
  some variables in design formula are characters, converting to factors

I can easily remove these duplicate entries with the following code:

>  remove_duplicates <- function(dds1) {
+      duplicates <- grep("\\.", rownames(dds1))
+      dds <- dds1[-duplicates]
+      dds
+  }

> remove_duplicates(test_tcga_blca_dds)
class: DESeqDataSet 
dim: 56716 433 
metadata(2): data_release version
assays(1): counts
rownames(56716): ENSG00000000003 ENSG00000000005 ... ENSG00000281912 ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name original_ensembl_gene_id
colnames(433): TCGA-GD-A2C5-01A-12R-A180-07 TCGA-BT-A42F-01A-11R-A23W-07 ... TCGA-GU-A766-01A-11R-A32O-07 TCGA-GU-A42R-01A-11R-A23N-07
colData names(230): sample patient ... subtype_Fusion.in.TNFRSF21 subtype_Fusion.in.ASIP

Now we have 56716 genes in the new data set, which are exactly these 209 genes, which were duplicates, but now we have 114 genes less than in the old data set (56830 - 56716). I have absolutely no idea where thes duplicates come from and why we are missing 114 genes, after removing these duplicates, compared to the old data set.

These missing genes will have a huge impact on normalization, clustering, and so on. Actually I encountered the problem while making a heatmap with clustering of the samples with a subset of the genes and the result was totally different. Heatmap of old data set

old_data_set

and resulting survival plot of these 4 clusters

old_data_set_survival

Heatmap of new data set

new_data_set

and resulting survival plotof these 4 clusters

new_data_set_survival

Can you confirm these problems and are these TCGAbiolinks- or DESeq-related problems?

Thanks in advance and best regards! @AljoLe

Code to reproduce the new data set

# select wanted data with GDCquery
# 
query <- GDCquery(project = "TCGA-BLCA",  # project-code for Bladder Cancer
                  data.category = "Transcriptome Profiling",  # catogry for RNA-Seq Data
                  data.type = "Gene Expression Quantification", # raw reads, or , aligned data
                  workflow.type = "HTSeq - Counts") # FPKM / HTSeq-Count / etc.

# download selected data
# accesed TCGA data on 2018-04-27 14:27
# 
GDCdownload(query, # name of the filtered dataset assigned above
            method = "api", # needs to be set in order to download from the api
            files.per.chunk = 10, # this should minimise prob. of corruption
            directory = "GDCdata") # save files to a seperate directory

# create a SummarizedExperiment with counts and clinical data
# 
test_tcga_blca <- GDCprepare(query,
                             summarizedExperiment = TRUE) # Downloads also clinical data

# delete *.rda files, because they are only needed for mapping genes while creating summarizedExperiment
# 
unlink("*.rda")

# Create DSeqDataSet of summarizedExperiment
# 
test_tcga_blca_dds <- DESeqDataSet(test_tcga_blca, 
                                 design = ~ shortLetterCode)

renaming the first element in assays to 'counts'
converting counts to integer mode
Warning messages:
1: In DESeqDataSet(test_tcga_blca, design = ~shortLetterCode) :
  209 duplicate rownames were renamed by adding numbers
2: In DESeqDataSet(test_tcga_blca, design = ~shortLetterCode) :
  some variables in design formula are characters, converting to factors

tiagochst commented 5 years ago

@AljoLe Please, I am not sure if you have both rda objects so I can compare them.

From april 2018, the main difference that came to my mind was the the genome of reference was updated. Which actually happened (screen shot below). Probably some genes were removed. You can get the raw matrix with all genes by setting SummarizedExperiment = FALSE in GDCprepare. There was a discussing before if we would use the most updated version (the last patched one) or not. We decided to the last patch of hg38 or hg19.

Screenshot from 2019-03-20 11-13-47

However I am quite surprised those small number of genes would be able to produce that effect in the normalization/clustering. But indeed the heatmap looks different for some genes, impacting the clustering.

So, If you set SummarizedExperiment = FALSE and use all genes, what will be the result ? Closer to the first one or the second ? If it is closer to the second one I don't believe it is the number of genes.

alsmnn commented 5 years ago

@tiagochst I don't know if I used the latest version of TCGAbiolinks available back in april 2018, I think it was an older one. The strange thing is, that these 209 duplicates are in every set of the TCGA data, when I download them with the latest TCGAbiolinks package. I have .RData-files of both, the new and the old data set. I can upload them, but they are quite big (19.5 MB per data set). Should I upload them here, or provide them on another way?
I downloaded the data set again with SummarizedExperiment = FALSE as you suggested.
Now I get a data frame with 60483 obs. of 434 variables. Are there any summaries happening while creating the SummarizedExperiment or why do I have even more genes in the raw data? Maybe the problem has something to do with this summary? The gene ids are looking like "ENSG00000000003.13" instead of ENSG00000000003 like in the SummarizedExperiment. I can't subset the data, created with SummarizedExperiment = FALSE because of the difference in the identifier table.

I hope that helps.

Best regards

tiagochst commented 5 years ago

Box, dropbox, or google drive would be fine for the files. The summarizedExperiment only annotate the gene ENSEBML, but some are removed in the patched version we don't change the data. You can remove the .13 from the emsembl gene names as substr("ENSG00000000003.13",1,15).

alsmnn commented 5 years ago

Here is the link to the files. I have no idea, why the new data sets are so big... https://drive.google.com/open?id=1wOs7JQLRkRpU9-2DY52Xq6PWM1mI3bXG I made a new heatmap plot from the data with all genes. I subsetted them and transformed them via log(all_data+1,2)

all_genes_wo_normalization

I am not able to say, which plot looks more like this, but I would tend to the first one. Maybe you can help after having a look on the data files.

martinguerrero89 commented 5 years ago

Hi there! Also found duplicated genes in my analysis that weren't present in January 2019. I was wondering the reason for this. They are exact duplicates and do not show to bring any different information to the dataset.

library(TCGAbiolinks)

query <- GDCquery(project = "TCGA-LUAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", sample.type=c("Primary solid Tumor"), workflow.type = "HTSeq - Counts", legacy=FALSE)

GDCdownload(query)

data <- GDCprepare(query, save = TRUE, save.filename = "RNA_LUAD.rda", remove.files.prepared = F)

sum(duplicated(rownames(assays(data)$HTSeq)))

sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS

Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8

attached base packages: [1] grid parallel stats4 stats graphics grDevices utils
[8] datasets methods base

other attached packages: [1] hgu133a.db_3.2.3 sva_3.30.0
[3] mgcv_1.8-25 nlme_3.1-137
[5] reshape2_1.4.3 dplyr_0.7.8
[7] gageData_2.20.0 gage_2.32.1
[9] RColorBrewer_1.1-2 bindrcpp_0.2.2
[11] consensusOV_1.4.1 MetaGxOvarian_1.2.0
[13] hgu133plus2.db_3.2.3 edgeR_3.24.3
[15] TCGAbiolinks_2.10.3 xlsx_0.6.1
[17] a4Base_1.30.0 a4Core_1.30.0
[19] a4Preproc_1.30.0 glmnet_2.0-16
[21] Matrix_1.2-15 multtest_2.38.0
[23] genefilter_1.64.0 mpm_1.0-22
[25] KernSmooth_2.23-15 annaffy_1.54.0
[27] KEGG.db_3.2.3 GO.db_3.7.0
[29] GEOquery_2.50.5 MetaGxBreast_1.2.0
[31] ExperimentHub_1.8.0 AnnotationHub_2.14.2
[33] impute_1.56.0 nsga2R_1.0
[35] mco_1.0-15.1 annotate_1.60.0
[37] XML_3.98-1.16 org.Hs.eg.db_3.7.0
[39] AnnotationDbi_1.44.0 WGCNA_1.66
[41] fastcluster_1.1.25 dynamicTreeCut_1.63-1
[43] biclust_2.0.1 colorspace_1.3-2
[45] MASS_7.3-51.1 shiny_1.2.0
[47] subspace_1.0.4 survminer_0.4.3
[49] ggpubr_0.2 magrittr_1.5
[51] genefu_2.14.0 AIMS_1.14.0
[53] e1071_1.7-0 iC10_1.4.2
[55] iC10TrainingData_1.3.1 pamr_1.55
[57] biomaRt_2.38.0 limma_3.38.3
[59] mclust_5.4.2 illuminaio_0.24.0
[61] caret_6.0-81 ggplot2_3.1.0
[63] lattice_0.20-38 matchingR_1.3.0
[65] Rcpp_1.0.0 gpuR_2.0.0
[67] cluster_2.0.7-1 cba_0.2-19
[69] proxy_0.4-22 doParallel_1.0.14
[71] iterators_1.0.10 foreach_1.4.4
[73] gplots_3.0.1 survcomp_1.32.0
[75] prodlim_2018.04.18 survival_2.43-1
[77] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[79] BiocParallel_1.16.2 matrixStats_0.54.0
[81] Biobase_2.42.0 GenomicRanges_1.34.0
[83] GenomeInfoDb_1.18.1 IRanges_2.16.0
[85] S4Vectors_0.20.1 BiocGenerics_0.28.0
[87] igraph_1.2.4

loaded via a namespace (and not attached): [1] Hmisc_4.1-1 class_7.3-14
[3] assertive.properties_0.0-4 Rsamtools_1.34.1
[5] crayon_1.3.4 backports_1.1.3
[7] rlang_0.3.0.1 XVector_0.22.0
[9] rjson_0.2.20 cmprsk_2.2-7
[11] bit64_0.9-7 glue_1.3.0
[13] tidyselect_0.2.5 km.ci_0.5-2
[15] tidyr_0.8.2 assertive.types_0.0-3
[17] zoo_1.8-4 SuppDists_1.1-9.4
[19] GenomicAlignments_1.18.1 xtable_1.8-3
[21] zlibbioc_1.28.0 hwriter_1.3.2
[23] rstudioapi_0.8 rpart_4.1-13
[25] GSVA_1.30.0 xfun_0.4
[27] caTools_1.17.1.1 KEGGREST_1.22.0
[29] tibble_1.4.2 interactiveDisplayBase_1.20.0 [31] flexclust_1.4-0 ggrepel_0.8.0
[33] base64_2.0 assertive.sets_0.0-3
[35] xlsxjars_0.6.1 Biostrings_2.50.1
[37] png_0.1-7 ipred_0.9-8
[39] withr_2.1.2 bitops_1.0-6
[41] plyr_1.8.4 assertive.base_0.0-7
[43] GSEABase_1.44.0 pcaPP_1.9-73
[45] assertive.models_0.0-2 ggvis_0.4.4
[47] pillar_1.3.1 GlobalOptions_0.1.0
[49] GenomicFeatures_1.34.3 assertive.matrices_0.0-2
[51] GetoptLong_0.1.7 assertive.reflection_0.0-4
[53] generics_0.0.2 lava_1.6.4
[55] tools_3.5.1 foreign_0.8-71
[57] munsell_0.5.0 fit.models_0.5-14
[59] compiler_3.5.1 httpuv_1.4.5.1
[61] rtracklayer_1.42.1 assertive.data.uk_0.0-2
[63] rJava_0.9-10 GenomeInfoDbData_1.2.0
[65] gridExtra_2.3 assertive.data.us_0.0-2
[67] later_0.7.5 recipes_0.1.4
[69] jsonlite_1.6 scales_1.0.0
[71] graph_1.60.0 lazyeval_0.2.1
[73] promises_1.0.1 latticeExtra_0.6-28
[75] R.utils_2.7.0 checkmate_1.8.5
[77] downloader_0.4 selectr_0.4-1
[79] yaml_2.2.0 survivalROC_1.0.3
[81] htmltools_0.3.6 memoise_1.1.0
[83] modeltools_0.2-22 locfit_1.5-9.1
[85] digest_0.6.18 rrcov_1.4-7
[87] assertthat_0.2.0 mime_0.6
[89] KMsurv_0.1-5 DESeq_1.34.1
[91] assertive.code_0.0-3 RSQLite_2.1.1
[93] amap_0.8-16 assertive.strings_0.0-3
[95] data.table_1.11.8 blob_1.1.1
[97] R.oo_1.22.0 survMisc_0.5.5
[99] preprocessCore_1.44.0 shinythemes_1.1.2
[101] splines_3.5.1 Formula_1.2-3
[103] RCurl_1.95-4.11 broom_0.5.1
[105] assertive.numbers_0.0-2 hms_0.4.2
[107] ConsensusClusterPlus_1.46.0 base64enc_0.1-3
[109] BiocManager_1.30.4 shape_1.4.4
[111] EDASeq_2.16.3 assertive.files_0.0-2
[113] nnet_7.3-12 matlab_1.0.2
[115] mvtnorm_1.0-8 circlize_0.4.5
[117] ModelMetrics_1.2.2 R6_2.3.0
[119] acepack_1.4.1 ShortRead_1.40.0
[121] gdata_2.18.0 robustbase_0.93-4
[123] assertive.data_0.0-3 stringr_1.3.1
[125] gower_0.1.2 htmlwidgets_1.3
[127] purrr_0.2.5 rvest_0.3.2
[129] ComplexHeatmap_1.20.0 openssl_1.1
[131] htmlTable_1.12 robust_0.4-18
[133] codetools_0.2-15 lubridate_1.7.4
[135] randomForest_4.6-14 gtools_3.8.1
[137] prettyunits_1.0.2 R.methodsS3_1.7.1
[139] gtable_0.2.0 DBI_1.0.0
[141] aroma.light_3.12.0 httr_1.4.0
[143] stringi_1.2.4 progress_1.2.0
[145] ggthemes_4.1.0 timeDate_3043.102
[147] xml2_1.2.0 assertive_0.3-5
[149] assertive.datetimes_0.0-2 rmeta_3.0
[151] additivityTests_1.1-4 readr_1.3.1
[153] geneplotter_1.60.0 DEoptimR_1.0-8
[155] bit_1.1-14 pkgconfig_2.0.2
[157] bootstrap_2017.2 bindr_0.1.1
[159] knitr_1.21

tiagochst commented 5 years ago

@martinguerrero89 I just found the problem. When we were annotating the genes using the ensembl database, for there were cases that one ENSEMBL ID was mapped to multiple entrezid. I still need to check all the cases but for HTSeq it was fixed. Those genes will have the same counts, just the entreiz metadata will be different.

BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks introduces duplicate rownames? #303