I'm using the TCGAanalyze_Normalization function for GC content normalization for all TCGA cancer types, working on data downloaded from GDC in 2021 (before the recent changes in GDC data release 32). I'm applying TCGAanalyze_Preprocessing first and the normalizing data for GC content using TCGAanalyze_Normalization.
Summing up, I think this leads to losing a significant number of genes during the following quantile filtering step, which we might be able to recover. Full explanation of this issue follows.
I realized that for some cancer types (HNSC, READ, SKCM, UVM) I got NaNs in the data after the Normalization step:
I get NaNs for different cancer types:
HNSC: 528,768 NaNs
READ: 133,760 NaNs
SKCM: 124,072 NaNs
UVM: 59,920 NaNs
I tried to investigate the issue and noticed that the withinLaneNormalization function from EDASeq returns both NaN and Inf values, the latter not being filtered before the betweenLaneNormalization function, which causes the presence of NaNs as output. To show where these come from, I ran the different steps of TCGAanalyze_Normalization and checked the the number of NaN or Inf I got from each (#check code blocks below):
After this, I tried using quantile filtering to the normalized counts with TCGAanalyze_Filtering(), I noticed that with your recent commit 'Update analyze.R' (ID b7b254e) this is possible despite the presence of NaNs thanks to na.rm=True:
After applying TCGAanalyze_Normalization and TCGAanalyze_Filtering, this is the number of genes I managed to retrieve for each of the five cancer types:
This allowed to obtain 0 NaNs as output to the TCGAanalyze_Normalization() function and around 2,942 more genes on average for all five cancer types, without requiring na.rm=True in TCGAanalyze_Filtering(), nor the filter for Nas in TCGAanalyze_Normalization here reported:
# In case NA's were produced to all rows
if(any(rowSums(is.na(tabDF_norm)) == ncol(tabDF_norm))){
tabDF_norm <- tabDF_norm[rowSums(is.na(tabDF_norm)) != ncol(tabDF_norm),]
}
The final gene sizes for the five cancer types would be the following:
This means that by filtering out Inf values early, the normalization and filtering procedures are overall less impacted by the missing values.
Does this change look reasonable to you?
If you wish to include these changes, I could open a PR to include the changes in the TCGAanalyze_Normalization function at least for the case of gcContent normalization as I have not explored thoroughly the case of geneLength normalization yet.
Hello,
I'm using the TCGAanalyze_Normalization function for GC content normalization for all TCGA cancer types, working on data downloaded from GDC in 2021 (before the recent changes in GDC data release 32). I'm applying TCGAanalyze_Preprocessing first and the normalizing data for GC content using TCGAanalyze_Normalization.
Summing up, I think this leads to losing a significant number of genes during the following quantile filtering step, which we might be able to recover. Full explanation of this issue follows.
I realized that for some cancer types (HNSC, READ, SKCM, UVM) I got
NaN
s in the data after the Normalization step:I get
NaN
s for different cancer types: HNSC: 528,768 NaNs READ: 133,760 NaNs SKCM: 124,072 NaNs UVM: 59,920 NaNsI tried to investigate the issue and noticed that the withinLaneNormalization function from EDASeq returns both
NaN
andInf
values, the latter not being filtered before the betweenLaneNormalization function, which causes the presence of NaNs as output. To show where these come from, I ran the different steps of TCGAanalyze_Normalization and checked the the number ofNaN
orInf
I got from each (#check
code blocks below):After this, I tried using quantile filtering to the normalized counts with
TCGAanalyze_Filtering()
, I noticed that with your recent commit 'Update analyze.R' (ID b7b254e) this is possible despite the presence of NaNs thanks to na.rm=True:After applying TCGAanalyze_Normalization and TCGAanalyze_Filtering, this is the number of genes I managed to retrieve for each of the five cancer types:
COAD: 35,504 genes HNSC: 33,701 genes READ: 35,313 genes SKCM: 33,672 genes UVM: 35,002 genes
Nevertheless, I separately tried to a filter for Inf values in the 3rd step of TCGAanalyze_Normalization() as follows:
This allowed to obtain 0 NaNs as output to the TCGAanalyze_Normalization() function and around 2,942 more genes on average for all five cancer types, without requiring na.rm=True in TCGAanalyze_Filtering(), nor the filter for Nas in TCGAanalyze_Normalization here reported:
The final gene sizes for the five cancer types would be the following:
COAD: 37,985 genes HNSC: 37,972 genes READ: 37,449 genes SKCM: 37,960 genes UVM: 36,534 genes
This means that by filtering out
Inf
values early, the normalization and filtering procedures are overall less impacted by the missing values.Does this change look reasonable to you?
If you wish to include these changes, I could open a PR to include the changes in the TCGAanalyze_Normalization function at least for the case of gcContent normalization as I have not explored thoroughly the case of geneLength normalization yet.
Please let me know, thank you