Issue with discrepancy in the number of counts for genes with TCGAanalyze_Normalization using geneLength versus gcContent

Dear people, based on an issue noticed in a previous post concerning survival analysis, i inspected the outputs of normalization of raw HTSeq counts using TCGAbiolinks and the TCGAanalyze_Normalization function, using both geneLength and gcContent options, in order to examine the relative counts for a specific vector of genes, selected for downstream analysis. The relative steps are illustrated in the above code chunk, in which only filtering was not perform, in order not to loose any of the genes.

load("hg38.coad.updated.htseqcounts.rda")
data
class: RangedSummarizedExperiment 
dim: 56963 521 
metadata(1): data_release
assays(1): HTSeq - Counts
rownames(56963): ENSG00000000003 ENSG00000000005 ...
  ENSG00000281912 ENSG00000281920
rowData names(3): ensembl_gene_id external_gene_name
  original_ensembl_gene_id
colnames(521): TCGA-3L-AA1B-01A-11R-A37K-07
  TCGA-DM-A1D8-01A-11R-A155-07 ...
  TCGA-AA-3675-01A-02R-0905-07 TCGA-G4-6323-01A-11R-1723-07
colData names(101): sample patient ...
  subtype_vascular_invasion_present subtype_vital_status

coad_data<-data[,  !data$is_ffpe]
coad_tp <- coad_data[, TCGAquery_SampleTypes(colnames(coad_data), typesample="TP")]
coad_nt <- coad_data[, TCGAquery_SampleTypes(colnames(coad_data), typesample="NT")]

coad_data_norm_1 <- TCGAanalyze_Normalization(coad_data[,c(colnames(coad_tp), colnames(coad_nt))],TCGAbiolinks::geneInfoHT) #normalization with geneLength

eg = as.data.frame(bitr(rownames(coad_data_norm_1),
+                         fromType="ENSEMBL",
+                         toType="SYMBOL",
+                         OrgDb="org.Hs.eg.db")) # annotation to gene symbols
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(rownames(coad_data_norm_1), fromType = "ENSEMBL", toType = "SYMBOL",  :
  0.31% of input gene IDs are fail to map...

eg <- eg[!duplicated(eg$SYMBOL),]
eg.ensembl <- as.character(eg$ENSEMBL)
dd <- match(eg.ensembl,rownames(coad_data_norm_1))

dat_filt_1 <- coad_data_norm_1[dd,]
rownames(dat_filt_1) <- as.character(eg$SYMBOL)

coad_data_norm_2 <- TCGAanalyze_Normalization(coad_data[,c(colnames(coad_tp), colnames(coad_nt))],TCGAbiolinks::geneInfoHT,method="gcContent") # with gcContent

...# using the same approach as above to end with a similar object called dat_filt_2
dim(dat_filt_2)
[1] 23366   506
dim(dat_filt_1)
[1] 23366   506

sig.genes <- read.table("hubs.signature.94.txt",header=T,stringsAsFactors = FALSE) # the input vector of needed gene symbols for downstream analysis such as clustering

sig.genes <- sig.genes$SYMBOL

cc1<- intersect(rownames(dat_filt_1),sig.genes)
cc2<- intersect(rownames(dat_filt_2),sig.genes)
length(cc1)
[1] 94
length(cc2)
[1] 94 # just for confirmation

filt.1 <- dat_filt_1[sig.genes,] #  geneLength
filt.2 <- dat_filt_2[sig.genes,] #  gcContent

filt.1[3:5,1:10]
       TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-DM-A1D8-01A-11R-A155-07
PSMD14                         1145                         1797
EMG1                              1                            4
CEBPB                             0                            0
       TCGA-AU-6004-01A-11R-1723-07 TCGA-T9-A92H-01A-11R-A37K-07
PSMD14                         1505                         1553
EMG1                              8                            0
CEBPB                             0                            0
       TCGA-AA-A01T-01A-21R-A16W-07 TCGA-CK-5913-01A-11R-1653-07
PSMD14                          788                         1254
EMG1                              0                            0
CEBPB                             0                            0
       TCGA-AD-6889-01A-11R-1928-07 TCGA-A6-6650-01A-11R-1774-07
PSMD14                         1322                          891
EMG1                              2                            2
CEBPB                             0                            0
       TCGA-AA-A022-01A-21R-A16W-07 TCGA-CM-5860-01A-01R-1653-07
PSMD14                         1208                          545
EMG1                              2                            0
CEBPB                             0                            0

filt.2[3:5,1:10]
       TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-DM-A1D8-01A-11R-A155-07
PSMD14                         4185                         4680
EMG1                            631                          622
CEBPB                          1250                         2139
       TCGA-AU-6004-01A-11R-1723-07 TCGA-T9-A92H-01A-11R-A37K-07
PSMD14                         4045                         6873
EMG1                            634                         1232
CEBPB                          1452                         1648
       TCGA-AA-A01T-01A-21R-A16W-07 TCGA-CK-5913-01A-11R-1653-07
PSMD14                        13036                         4241
EMG1                           1847                         1047
CEBPB                          3085                         1350
       TCGA-AD-6889-01A-11R-1928-07 TCGA-A6-6650-01A-11R-1774-07
PSMD14                         5735                         4227
EMG1                            746                          900
CEBPB                          1523                         2321
       TCGA-AA-A022-01A-21R-A16W-07 TCGA-CM-5860-01A-01R-1653-07
PSMD14                         6008                         4254
EMG1                            542                          901
CEBPB                          1228                         4327

As you can see from an illustrative subset of the normalized counts, in the second approach with gcContent, for example CEBP shows a high number of counts, whereas in the first case with geneLength, has 0 counts (also this is the case for a bigger number of samples. Also, the same is relatively evident in the other shown genes. Thus, this would not be an issue or concern for any downstream analysis ? For example if filtering is performed after normalization ? as some genes/genomic features could be lost ?

Or i misinterpret something here ?

Best,

Efstathios-Iason Vlachavas

BioinformaticsFMRP / TCGAbiolinks

Issue with discrepancy in the number of counts for genes with TCGAanalyze_Normalization using geneLength versus gcContent #173