[Bug]: Possible bug in QC.rmd script in DNA microarray QC pipeline

dmsoanes commented 1 month ago

Contact Details

d.m.soanes@exeter.ac.uk

What happened?

Microarray data was not cell sorted The first part of the QC pipeline seemed to work fine: 1: Column names were correct 2: raw.gds and rawNorm.gds files were produced 3: calcQCMetrics.r worked (files attached) - although there was a warning: Sentrix position model failed, skipping not sure if that is important 4: The markdown script QC.rmd failed - SexMismatches.csv and PlotsWithinDNAmMismatches.pdf files produced but contained no data.

How can the bug be reproduced?

Uploaded config, output, log and sample sheet files - other files can be found at /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN config.r.txt config.txt PassQCStatusAllSamples.csv QCDNAdata_2559367.err.txt QCDNAdata_2559367.log QCMetricsPostSampleCheck.csv

Relevant log output

processing file: QC.rmd

processing file: ./rmarkdownChild/sexCheck.rmd
Error in `$<-.data.frame`:
! replacement has 94 rows, data has 96
Backtrace:
 1. base::`$<-`(`*tmp*`, predictedduplicates, value = `<chr>`)
 2. base::`$<-.data.frame`(`*tmp*`, predictedduplicates, value = `<chr>`)
Warning message:
In in_dir(input_dir(), expr) :
  You changed the working directory to /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN (probably via setwd()). It will be restored to /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN/original_scripts/BrainFANS/array/DNAm/preprocessing. See the Note section in ?knitr::knit

Quitting from lines 858-875 [findDuplicateSamplesPostQC] (QC.rmd)
Execution halted

sof202 commented 1 month ago

Initial notes:

code section:

passQC<-QCSum$Basename[as.logical(QCSum[,"passQCS2"])]

snpCor<-snpCor[passQC, passQC]

if(max(snpCor, na.rm = TRUE) > 0.8){
    # pull out samples that are genetically identical
    predictedduplicates<-vector(length = ncol(snpCor))
    for(i in 1:ncol(snpCor)){
      predictedduplicates[i]<-paste(sort(names(which(snpCor[,i] > 0.8))), sep = "|", collapse = "|")
    }
  } else {
      predictedduplicates<-colnames(snpCor)
    } 

  QCmetrics$predictedduplicates<-predictedduplicates

The max(snpCor) for Darren's data gives 1, so the if statement is running.

The number of rows in predictedduplicates is not coming out as the same number of rows as in QCmetrics. I have checked and snpCor has the same number of column as QCmetrics initially.

Problem lies here:

snpCor<-snpCor[passQC, passQC]

This bit here can make it so that snpCor has less columns. Causing a there to be too few columns in snpCor (now less than there are rows in QCmetrics).

This is covered for in the if statement with

predictedduplicates<-vector(length = ncol(snpCor))

But unfortunately, this line comes too late. The number of columns in snpCor has already changed by this point.

sof202 commented 1 month ago

I do not know for sure what the intention is behind this block. My guess is this:

You want to filter snpCor to only those that pass the QC. Then you want to use these to get predicted duplicates (those with a very high correlation). For all other rows, you would just want an empty string as there are no predicted duplicates.

If this is the case. Do we want there to be NAs (or something similar) in predictedduplicates for the rows that did not pass QC? Or do we want to subset QCmetrics on passQC as well to get these variables to have the same length?

Maybe something else?

sof202 commented 1 month ago

Also this if statement:

if(max(snpCor, na.rm = TRUE) > 0.8){
...
}

will always run as snpCor's diagonal is 1s. This conditional logic can be removed no?

ejh243 / BrainFANS