YuanTian1991 / ChAMP

19 stars 22 forks source link

Debugging champ.norm #32

Open iskandari opened 1 year ago

iskandari commented 1 year ago

I have recently encountered the following cryptic error on some GSE datasets when normalizing with the BMIQ method:

myLoad <- champ.load(directory=getwd(), arraytype="450K")
myNorm<-champ.norm(beta=myLoad$beta, method="BMIQ", arraytype="450K", cores=6, plotBMIQ=FALSE)
[===========================]
[>>>>> ChAMP.NORM START <<<<<<]
-----------------------------
champ.norm Results will be saved in ./CHAMP_Normalization/
[ SWAN method call for BOTH rgSet and mset input, FunctionalNormalization call for rgset only , while PBC and BMIQ only needs beta value. Please set parameter correctly. ]

<< Normalizing data with BMIQ Method >>
Note that,BMIQ function may fail for bad quality samples (Samples did not even show beta distribution).
6 cores will be used to do parallel BMIQ computing.
Error in champ.BMIQ(beta[, x], design.v, sampleID = colnames(beta)[x],  : 
  task 165 failed - "need at least 2 points to select a bandwidth automatically"

I have found that, sometimes, there is a corrupted sample that needs to be removed from the original IDAT files even after running champ.load, which automatically applies filters. These corrupted samples can easily be determined, for example, when an idat file is significantly different in size (typically smaller) than the rest. After their removal, BMIQ runs.

However, I have recently come across several datasets that do not have obviously corrupted files judging by their size. After running champ.load(), champ.norm() with BMIQ fails. I have verified that there are no zero or null values in the beta matrix. How can one determine which bad sample(s) are causing this error that crashes normalization?

Thank you

wgmao commented 8 months ago

For the example you provided here, it means the file corresponding to the 165th row in myLoad$pd (165th col in myLoad$beta) needs to be removed.

myNorm<-champ.norm(beta=myLoad$beta[,-165], method="BMIQ", arraytype="450K", cores=6, plotBMIQ=FALSE)