QuackenbushLab / NetworkDataCompanion

An R library of utilities for performing analyses on TCGA and GTEx data using the Network Zoo
GNU General Public License v3.0
4 stars 0 forks source link

Support for ignoring NAs of individual probes in mean promoter methylation computation #40

Open FischerJoBio opened 1 year ago

FischerJoBio commented 1 year ago

As soon as there are individual probes mapped to a gene that have NAs in the beta file, probeToMeanPromoterMethylation generates NA mean values for the entire gene.

I would suggest to replace

summarise_at(colnames(mappedBetasLong)[3:(ncol(mappedBetasLong))], mean) by summarise_at(colnames(mappedBetasLong)[3:(ncol(mappedBetasLong))], mean, na.rm=T)

which solves this issue by ignoring individual probes that give NA during the computation.

katehoffshutta commented 8 months ago

@FischerJoBio This makes sense to me, with two caveats:

  1. If ALL probes mapped to a particular gene are NA, this function will return NaN, which may then cause downstream problems because it's an unexpected value (not numeric, not NA)
  2. Do you think we should report back this exclusion of missing data for the user? For example, they may wish to know that only 3 of 10 probes were used to calculate the mean in Gene X because of missingness in the others.However, maybe this is too much info.

What do you think about these?

FischerJoBio commented 8 months ago

@katehoffshutta Both good points that fall short in the original as well as suggested fix.

My suggestion for 1) would be to catch this and replace NaN by NA, which still is consistent in terms of meaning and should be good for downstream processing. For 2) maybe we should introduce a mincoverage parameter, to avoid these unrobust predictions if the user does not want them (default to 1) and additionally add a verbose option for raising the warnings. My gut feeling is that these are too many to be meaningful, but it gives the user the option to analyze.