ejh243 / BrainFANS

Complex Disease Epigenomics Group's quality control and analysis pipelines for DNA methylation arrays, SNP arrays, BS-Seq, ATAC-Seq and ChIP-Seq
Other
10 stars 4 forks source link

[Bug]: Possible bug in QC.rmd script in DNAm QC pipeline #173

Closed dmsoanes closed 2 months ago

dmsoanes commented 3 months ago

Contact Details

d.m.soanes@exeter.ac.uk

What happened?

Microarray data was not cell sorted (small test dataset) The first part of the QC pipeline seemed to work fine: 1: Column names were correct 2: raw.gds and rawNorm.gds files were produced 3: calcQCMetrics.r worked (files attached) - although there was a warning: Sentrix position model failed, skipping not sure if that is important 4: The markdown script QC.rmd failed:

How can the bug be reproduced?

Uploaded config, output, log and sample sheet files - other files can be found at /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN config.txt PassQCStatusAllSamples.csv QCDNAdata_2556271.log QCMetricsPostSampleCheck.csv sampleSheet.csv config.r.txt QCDNAdata_2556271.err.txt

Relevant log output

processing file: QC.rmd
Error in `names(x) <- value`:
! 'names' attribute [6] must be the same length as the vector [5]
Backtrace:
 1. base::`colnames<-`(`*tmp*`, value = `<chr>`)
 2. base::`colnames<-`(`*tmp*`, value = `<chr>`)
Warning message:
In in_dir(input_dir(), expr) :
  You changed the working directory to /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN (probably via setwd()). It will be restored to /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN/original_scripts/BrainFANS/array/DNAm/preprocessing. See the Note section in ?knitr::knit

Quitting from lines 61-141 [definPlotParams] (QC.rmd)
Execution halted
sof202 commented 3 months ago

I'm fairly confident that the backtrace points to these lines in the source code:

137 : if(ncol(plotCols) > 1){
138 :   colnames(plotCols)<-c("Basename", techVar, bioVar)
139 : }

Some of the below columns aren't being created properly:

Basename:

plotCols<-data.frame("Basename" = QCmetrics$Basename)

techVar:

if(numVar == nrow(QCmetrics)){
  uniqueVar<-c(uniqueVar, item)
} else {
  if(sum(is.na(tmpVar)) == length(tmpVar)){
    missingVar<-c(missingVar,item)
  } else {
    colVar<-rainbow(numVar)[tmpVar] ### convert to colours for plotting purposes
<<!!>>    plotCols<-data.frame(plotCols, colVar, stringsAsFactors = FALSE)
    legendParams[[item]]<-cbind(levels(tmpVar),rainbow(numVar))
  }
}

### remove from techVar 
techVar<-techVar[!techVar %in% c(uniqueVar, missingVar)]

bioVar:

if(numVar == nrow(QCmetrics)){
  uniqueVar<-c(uniqueVar, item)
} else {
  if(sum(is.na(tmpVar)) == length(tmpVar)){
    missingVar<-c(missingVar,item)
  } else {
  colVar<-rainbow(numVar)[tmpVar] ### convert to colours for plotting purposes
<<!!>>  plotCols<-data.frame(plotCols, colVar, stringsAsFactors = FALSE)
  legendParams[[item]]<-cbind(levels(tmpVar),rainbow(numVar))
  }
}

### remove from bioVar 
bioVar<-bioVar[!bioVar %in% c(uniqueVar, missingVar)]

Here's the relevant part of Darren's config.R:

## technical variables

techVar<-c("Sentrix_ID","Sentrix_Position","Plate", "Chip_ID", "Chip_Location")

## biological variables

bioVar<-c("Individual_ID","Sex", "Age")

Further analysis:

# after the '##remove from ...' commands
length(BioVar) + length(techVar) = 5
# but plotCols has 5 columns, 1 being Basename
sof202 commented 3 months ago

Basename exists in Darren's sampleSheet.csv and QCmetrics.rdata. So you can rule that one out. Also, all values in techVar and bioVar match (between the config and the QCmetrics.rdata) [log file concurs]

sof202 commented 3 months ago

Config file has

techVar<-c("Sentrix_ID","Sentrix_Position", !! "Plate" !!, "Chip_ID", "Chip_Location")

But colnames(QCmetrics) gives:

 [1] !! "X...Plate" !!             "Chip_ID"
 [3] "Chip_Location"               "Individual_ID"
 [5] "family"                      "Sex"
 [7] "Age"                         "Sample_Collection_Date"
 [9] "Sample_Concentration_Status" "Sample_ID"
[11] "Sentrix_Position"            "Sentrix_ID"
[13] "Basename"                    "M.median"
[15] "U.median"                    "intens.ratio"
[17] "intensPASS"                  "bisulfCon"
[19] "PC1_cp"                      "PC2_cp"
[21] "PC3_cp"                      "PC1_betas"
[23] "PC2_betas"                   "PC3_betas"
[25] "PC4_betas"                   "PC5_betas"
[27] "PC6_betas"                   "PC7_betas"
[29] "pFilter"                     "DNAmAge"
[31] "horvath.n_missing"           "CCDNAmAge"
[33] "rmsd"                        "sdd"
[35] "sadd"                        "srms"
[37] "nNAs"                        "nNAsPer"
[39] "x.cp"                        "y.cp"
[41] "predSex.x"                   "predSex.y"
[43] "predSex"

Could this be it? Not sure why this isn't being caught by:

if(item %in% colnames(QCmetrics)){
...
} else {
  print(paste("Column", item, "not found. Please check config file.")) ### change to hard stop
  #quit(save = "no")
}

The log file doesn't have any mention of any columns not being found.

sof202 commented 2 months ago

Update: @dmsoanes has found that this is a problem with their metadata. "" was found before "Plate" in the sampleSheet.csv when not using the locale en_US.utf8 (or other UTF-8 encodings I guess). The default locale for R on our HPC is "C", causing the "" character to be read as ... something (doesn't matter what, just that it was being read in).

This caused the read.csv function (when reading the sample sheet) to view "Plate" as an invalid name. As such the read.csv function ran make.names on "Plate", converting it into "X...Plate", a valid name. "X...Plate" did not exist inside the techVar in the config, resulting in the mismatch of columns.

I'm still not sure why this slipped through the check however.

if(length(techVar) > 0){
  for(item in techVar){
    ### check if exists in QC metircs
      if(item %in% colnames(QCmetrics)){
...
}

This check should have failed, but no sign of this can be found in the corresponding log files.

To see this yourself, run:

find /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN -type f -name "*.log" -o -name "*.err" | xargs grep "not found."

Regardless, because "X...Plate" is not in techVar, the column name persists in techVar after this line:

techVar<-techVar[!techVar %in% c(uniqueVar, missingVar)]

However, plotCols never gets added to, so there is a mismatch in the number of columns.

Currently there is a call to use a stop command instead of print:

print(paste("Column", item, "not found. Please check config file.")) ### change to hard stop

I will amend this in a pull request.