Problems with Normalization using geneLength, but not gcContent.

Hello. Thank you for developing this amazing resource. I am having some problems with normalizing RNAseq data from TCGA. In brief, I am doing a "sanity check" using breast cancer data subdivided into estrogen receptor positive and negative subgroups. Normalization using this method produces expected results: dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(exp.brca, geneInfo = TCGAbiolinks::geneInfoHT, method = "gcContent")

Here is the result of analyzing the expression of the estrogen receptor using the following:

ESR1_Norm<- dataNorm["ENSG00000091831", normalSamples] ESR1_Pos_Tumors<- dataNorm ["ENSG00000091831", erPosTumors] ESR1_Neg_Tumors<- dataNorm ["ENSG00000091831", erNegTumors] boxplot(ESR1_Norm, ESR1_Pos_Tumors, ESR1_Neg_Tumors, main = "ESR1", names = c("ESR1_Norm","ESR1_Pos_Tumors", "ESR1_Neg_Tumors"), las=2, cex.axis=.7)

However, using the following produces results that are not reasonable:

dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(exp.brca, geneInfo = TCGAbiolinks::geneInfoHT, method = "geneLength")

I have looked at the code for normalization, and the issue seems to be linked to these lines of code in TCGAnalysis_Normalization:

rownames(tabDF) <- GenesCutID(as.matrix(rownames(tabDF)))
tabDF <- tabDF[rownames(tabDF) != "?", ]

if I look at the values for tabDF just before these lines are executes, I see this:

TCGA-B6-A0X5-01A-21R-A109-07	62284
TCGA-BH-A1FB-11A-33R-A13Q-07	20930
TCGA-A8-A093-01A-11R-A00Z-07	83207
TCGA-AN-A0FL-01A-11R-A034-07	62
TCGA-A2-A0D4-01A-11R-A00Z-07	40213
TCGA-BH-A0DG-01A-21R-A12P-07	11280
TCGA-E2-A572-01A-13R-A31O-07	7204
TCGA-C8-A278-01A-11R-A169-07	557
TCGA-A8-A09V-01A-11R-A034-07	39921

. . .

Immediately after the execution of these lines, I see this:

TCGA-B6-A0X5-01A-21R-A109-07	1
TCGA-BH-A1FB-11A-33R-A13Q-07	2
TCGA-A8-A093-01A-11R-A00Z-07	0
TCGA-AN-A0FL-01A-11R-A034-07	0
TCGA-A2-A0D4-01A-11R-A00Z-07	0
TCGA-BH-A0DG-01A-21R-A12P-07	0
TCGA-E2-A572-01A-13R-A31O-07	0
TCGA-C8-A278-01A-11R-A169-07	0
TCGA-A8-A09V-01A-11R-A034-07	0
TCGA-AN-A041-01A-11R-A034-07	0
TCGA-E9-A1N8-01A-11R-A144-07	0
TCGA-E2-A158-01A-11R-A12D-07	0

. . .

Finally, I am creating the SummarizedExperiment using the following, which download HTSeq files from the GDC:

query <- TCGAbiolinks::GDCquery(project = c("TCGA-BRCA"), data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification",sample.type = c("Primary solid Tumor","Solid Tissue Normal"), workflow.type = "HTSeq - Counts") TCGAbiolinks::GDCdownload(query) exp.brca<-TCGAbiolinks::GDCprepare(query = query, save = TRUE, save.filename = "brcaExp.rda")

Thank you for any insights.

Eric Shelden

An update: I think the line of code causing the issue is: rownames(tabDF) <- GenesCutID(as.matrix(rownames(tabDF)))

To see what was going on, I wrote out tabDF before and after execution of this line. I found that the count data remains the same, but the gene IDs are changing location in the array. My gene of interest starts in row 1945, but ends up after this line on row 29455. However, the count data stays in place, so my gene of interest gets the expression of whatever was in row 29455. I am showing the first 8 columns of rows 1945-1954 below, and you can see that the count data for each sample is the same.

Before the line executes:

ENSG00000091831 | 62284 | 20930 | 83207 | 62 | 40213 | 11280 | 7204 ENSG00000091844 | 85 | 93 | 71 | 17 | 64 | 130 | 4 ENSG00000091879 | 139 | 485 | 469 | 2176 | 806 | 860 | 97 ENSG00000091947 | 1279 | 1702 | 2749 | 1292 | 8599 | 1068 | 8512 ENSG00000091972 | 244 | 1279 | 437 | 237 | 531 | 964 | 146 ENSG00000091986 | 2080 | 17629 | 13902 | 1077 | 4431 | 20445 | 1339 ENSG00000092009 | 1 | 126 | 12 | 1 | 91 | 84 | 6 ENSG00000092010 | 10292 | 5428 | 9833 | 9197 | 42511 | 13504 | 27552 ENSG00000092020 | 922 | 763 | 937 | 459 | 1150 | 1018 | 691 ENSG00000092036 | 1860 | 1167 | 1658 | 600 | 2701 | 1784 | 778

After it executes:

ENSG00000111341 | 62284 | 20930 | 83207 | 62 | 40213 | 11280 | 7204 ENSG00000111348 | 85 | 93 | 71 | 17 | 64 | 130 | 4 ENSG00000111361 | 139 | 485 | 469 | 2176 | 806 | 860 | 97 ENSG00000111371 | 1279 | 1702 | 2749 | 1292 | 8599 | 1068 | 8512 ENSG00000111405 | 244 | 1279 | 437 | 237 | 531 | 964 | 146 ENSG00000111424 | 2080 | 17629 | 13902 | 1077 | 4431 | 20445 | 1339 ENSG00000111445 | 1 | 126 | 12 | 1 | 91 | 84 | 6 ENSG00000111452 | 10292 | 5428 | 9833 | 9197 | 42511 | 13504 | 27552 ENSG00000111490 | 922 | 763 | 937 | 459 | 1150 | 1018 | 691 ENSG00000111536 | 1860 | 1167 | 1658 | 600 | 2701 | 1784 | 778

BioinformaticsFMRP / TCGAbiolinks

Problems with Normalization using geneLength, but not gcContent. #154