ejh243 / BrainFANS

Complex Disease Epigenomics Group's quality control and analysis pipelines for DNA methylation arrays, SNP arrays, BS-Seq, ATAC-Seq and ChIP-Seq
Other
8 stars 4 forks source link

[Bug]: Possible bug in QC.rmd script in DNA microarray QC pipeline #177

Closed dmsoanes closed 1 month ago

dmsoanes commented 1 month ago

Contact Details

d.m.soanes@exeter.ac.uk

What happened?

Microarray data was not cell sorted The first part of the QC pipeline seemed to work fine: 1: Column names were correct 2: raw.gds and rawNorm.gds files were produced 3: calcQCMetrics.r worked (files attached) - although there was a warning: Sentrix position model failed, skipping not sure if that is important 4: The markdown script QC.rmd failed:

How can the bug be reproduced?

Uploaded config, output, log and sample sheet files - other files can be found at /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN config.r.txt config.txt PassQCStatusAllSamples.csv QCDNAdata_2558375.err.txt QCDNAdata_2558375.log QCMetricsPostSampleCheck.csv sampleSheet.csv

Relevant log output

processing file: QC.rmd
Error in `contrasts<-`:
! contrasts can be applied only to factors with 2 or more levels
Backtrace:
 1. stats::lm(...)
 3. stats::model.matrix.default(mt, mf, contrasts)
 4. stats::`contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]])
Warning message:
In in_dir(input_dir(), expr) :
  You changed the working directory to /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN (probably via setwd()). It will be restored to /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN/original_scripts/BrainFANS/array/DNAm/preprocessing. See the Note section in ?knitr::knit

Quitting from lines 402-413 [controlProbeHeatmap] (QC.rmd)
Execution halted
sof202 commented 1 month ago

Another peculiar thing:

> str(QCmetrics)
'data.frame':   96 obs. of  61 variables:
 $ Plate                      : int  6 6 6 6 6 6 6 6 6 6 ...
 $ Chip_ID                    : int  61 61 61 61 61 61 61 61 62 62 ...
 $ Chip_Location              : chr  "A01" "B01" "C01" "D01" ...
 $ Individual_ID              : int  109131 109131 109131 109521 109521 109521 105512 105512 107911 107911 ...
 $ family                     : int  10913 10913 10913 10952 10952 10952 10551 10551 10791 10791 ...
 $ Sex                        : chr  "M" "M" "M" "F" ...

And yet:

if(class(QCmetrics[,item]) != "factor"){
  print(paste("Converting technical variable", item, "to factor."))
  tmpVar<-as.factor(QCmetrics[,item])
} 

Doesn't appear to be running either (as I can't find evidence of this message in any log/error file in this directory)

sof202 commented 1 month ago

Regardless I believe the error is this:

The "Plate" column has length(levels()) of 1. so in the following logic:

if(item %in% colnames(QCmetrics)){
  QCmetrics[which(QCmetrics[,item] == ""),item]<-NA
  if(class(QCmetrics[,item]) != "factor"){
    print(paste("Converting technical variable", item, "to factor."))
    tmpVar<-as.factor(QCmetrics[,item])
  } else {
    tmpVar<-QCmetrics[,item]
  }
  numVar<-length(levels(tmpVar))
  if(numVar == nrow(QCmetrics)){
    uniqueVar<-c(uniqueVar, item)
  } else {
          if(sum(is.na(tmpVar)) == length(tmpVar)){
              missingVar<-c(missingVar,item)
          } else {
              colVar<-rainbow(numVar)[tmpVar] ### convert to colours for plotting purposes
              plotCols<-data.frame(plotCols, colVar, stringsAsFactors = FALSE)
              legendParams[[item]]<-cbind(levels(tmpVar),rainbow(numVar))
          }
      }
} else {
  print(paste("Column", item, "not found. Please check config file.")) ### change to hard stop
  #quit(save = "no")
}

Neither missingVar nor uniqueVar are set, and so "Plate" is added to plotCols and is kept in techVar as a result. This is not the intended behaviour. If the length(levels()) of an item is 1, then contrasts will fail as it has the condition:

! contrasts can be applied only to factors with 2 or more levels

Solution: Add a check for the length(levels(tmpVar)) == 1

dmsoanes commented 1 month ago

So if I removed Plate from the list of technical variables in the config file, would that also solve the problem?

sof202 commented 1 month ago

Provided that there are no other columns that have length(levels()) equal to one yes.

EDIT: I've checked and this is not the case

> for (col in colnames(QCmetrics)) print(paste(col, length(levels(as.factor(QCmetrics[[col]])))))
[1] "Plate 1"
[1] "Chip_ID 12"
[1] "Chip_Location 96"
[1] "Individual_ID 34"
[1] "family 34"
[1] "Sex 2"
[1] "Age 96"
[1] "Sample_Collection_Date 95"
[1] "Sample_Concentration_Status 2"
[1] "GPSR.isolation.ID 96"
[1] "Sentrix_Position 8"
[1] "Sentrix_ID 12"
[1] "Basename 96"
[1] "M.median 94"
[1] "U.median 95"
[1] "intens.ratio 96"
[1] "intensPASS 1"
[1] "bisulfCon 96"
[1] "PC1_cp 96"
[1] "PC2_cp 96"
[1] "PC3_cp 96"
[1] "PC4_cp 96"
[1] "PC1_betas 96"
[1] "PC2_betas 96"
[1] "PC3_betas 96"
[1] "PC4_betas 96"
[1] "PC5_betas 96"
[1] "PC6_betas 96"
[1] "PC7_betas 96"
[1] "PC8_betas 96"
[1] "PC9_betas 96"
[1] "PC10_betas 96"
[1] "PC11_betas 96"
[1] "PC12_betas 96"
[1] "PC13_betas 96"
[1] "PC14_betas 96"
[1] "PC15_betas 96"
[1] "PC16_betas 96"
[1] "PC17_betas 96"
[1] "PC18_betas 96"
[1] "PC19_betas 96"
[1] "PC20_betas 96"
[1] "PC21_betas 96"
[1] "PC22_betas 96"
[1] "PC23_betas 96"
[1] "PC24_betas 96"
[1] "pFilter 1"
[1] "DNAmAge 96"
[1] "horvath.n_missing 1"
[1] "CCDNAmAge 96"
[1] "rmsd 96"
[1] "sdd 96"
[1] "sadd 96"
[1] "srms 96"
[1] "nNAs 22"
[1] "nNAsPer 22"
[1] "x.cp 96"
[1] "y.cp 96"
[1] "predSex.x 2"
[1] "predSex.y 2"
[1] "predSex 2"