lack of filtering for Inf values in TCGABiolinks_Normalization() leads to loss of genes

Hello,

I'm using the TCGAanalyze_Normalization function for GC content normalization for all TCGA cancer types, working on data downloaded from GDC in 2021 (before the recent changes in GDC data release 32). I'm applying TCGAanalyze_Preprocessing first and the normalizing data for GC content using TCGAanalyze_Normalization.

Summing up, I think this leads to losing a significant number of genes during the following quantile filtering step, which we might be able to recover. Full explanation of this issue follows.

I realized that for some cancer types (HNSC, READ, SKCM, UVM) I got NaNs in the data after the Normalization step:

#Normalization with GC-content
dataNorm_COAD <- TCGAanalyze_Normalization(
 tabDF = dataPrep_COAD,
 geneInfo = geneInfoHT,
 method = "gcContent"
)
sum(is.nan(dataNorm_COAD))# = 348894

I get NaNs for different cancer types: HNSC: 528,768 NaNs READ: 133,760 NaNs SKCM: 124,072 NaNs UVM: 59,920 NaNs

I tried to investigate the issue and noticed that the withinLaneNormalization function from EDASeq returns both NaN and Inf values, the latter not being filtered before the betweenLaneNormalization function, which causes the presence of NaNs as output. To show where these come from, I ran the different steps of TCGAanalyze_Normalization and checked the the number of NaN or Inf I got from each (#check code blocks below):

message("Step 2 of 4: withinLaneNormalization ...")
tmp <- EDASeq::withinLaneNormalization(
 tmp, "gcContent",
 which = "upper",
 offset = TRUE
)

#check
sum(is.nan(EDASeq::normCounts(tmp)))# = 95498
sum(is.infinite(EDASeq::normCounts(tmp)))# = 28516
message("Step 3 of 4: betweenLaneNormalization ...")

if(any(is.na(EDASeq::normCounts(tmp)))) {
 tmp <- tmp[rowSums(is.na(EDASeq::normCounts(tmp))) == 0,]
}

#check
sum(is.nan(EDASeq::normCounts(tmp)))# = 0
sum(is.infinite(EDASeq::normCounts(tmp)))# = 16896

tmp <- EDASeq::betweenLaneNormalization(
 tmp,
 which = "upper",
 offset = TRUE
)

#check
sum(is.nan(EDASeq::offst(tmp)))# = 16896
sum(is.infinite(EDASeq::offst(tmp)))# = 0
sum(is.nan(EDASeq::normCounts(tmp)))# = 0
sum(is.infinite(EDASeq::normCounts(tmp)))# = 16896

normCounts <- log(rawCounts[rownames(tmp),] + .1) + EDASeq::offst(tmp)
normCounts <- floor(exp(normCounts) - .1)

#check
sum(is.nan(normCounts))# = 16896
sum(is.infinite(normCounts))# = 0

message("Step 4 of 4: .quantileNormalization ...")
tmp <- t(.quantileNormalization(t(normCounts)))
tabDF_norm <- floor(tmp)

#check
sum(is.nan(tabDF_norm))# = 381696
sum(is.infinite(tabDF_norm))# = 0
# In case NA's were produced to all rows
if(any(rowSums(is.na(tabDF_norm)) == ncol(tabDF_norm))){
    tabDF_norm <- tabDF_norm[rowSums(is.na(tabDF_norm)) != ncol(tabDF_norm),]
}

#check
sum(is.nan(tabDF_norm))# = 348894
sum(is.infinite(tabDF_norm))# = 0

After this, I tried using quantile filtering to the normalized counts with TCGAanalyze_Filtering(), I noticed that with your recent commit 'Update analyze.R' (ID b7b254e) this is possible despite the presence of NaNs thanks to na.rm=True:

if (method == "quantile") {
    GeneThresh <- as.numeric(quantile(rowMeans(tabDF,na.rm = TRUE), qnt.cut))
    geneFiltered <- names(which(rowMeans(tabDF) > GeneThresh))
    tabDF_Filt <- tabDF[geneFiltered,]
}

After applying TCGAanalyze_Normalization and TCGAanalyze_Filtering, this is the number of genes I managed to retrieve for each of the five cancer types:

COAD: 35,504 genes HNSC: 33,701 genes READ: 35,313 genes SKCM: 33,672 genes UVM: 35,002 genes

Nevertheless, I separately tried to a filter for Inf values in the 3rd step of TCGAanalyze_Normalization() as follows:

message("Step 3 of 4: betweenLaneNormalization ...")
if(any(is.na(EDASeq::normCounts(tmp))) || any(is.infinite(EDASeq::normCounts(tmp)))) {
 tmp <- tmp[rowSums(is.na(EDASeq::normCounts(tmp))) == 0,]
 tmp <- tmp[rowSums(is.infinite(EDASeq::normCounts(tmp))) == 0,]
}

This allowed to obtain 0 NaNs as output to the TCGAanalyze_Normalization() function and around 2,942 more genes on average for all five cancer types, without requiring na.rm=True in TCGAanalyze_Filtering(), nor the filter for Nas in TCGAanalyze_Normalization here reported:

# In case NA's were produced to all rows
if(any(rowSums(is.na(tabDF_norm)) == ncol(tabDF_norm))){
    tabDF_norm <- tabDF_norm[rowSums(is.na(tabDF_norm)) != ncol(tabDF_norm),]
}

The final gene sizes for the five cancer types would be the following:

COAD: 37,985 genes HNSC: 37,972 genes READ: 37,449 genes SKCM: 37,960 genes UVM: 36,534 genes

This means that by filtering out Inf values early, the normalization and filtering procedures are overall less impacted by the missing values.

Does this change look reasonable to you?

If you wish to include these changes, I could open a PR to include the changes in the TCGAanalyze_Normalization function at least for the case of gcContent normalization as I have not explored thoroughly the case of geneLength normalization yet.

Please let me know, thank you

BioinformaticsFMRP / TCGAbiolinks

lack of filtering for Inf values in TCGABiolinks_Normalization() leads to loss of genes #536