Closed dmsoanes closed 2 months ago
I'm fairly confident that the backtrace points to these lines in the source code:
137 : if(ncol(plotCols) > 1){
138 : colnames(plotCols)<-c("Basename", techVar, bioVar)
139 : }
Some of the below columns aren't being created properly:
Basename:
plotCols<-data.frame("Basename" = QCmetrics$Basename)
techVar:
if(numVar == nrow(QCmetrics)){
uniqueVar<-c(uniqueVar, item)
} else {
if(sum(is.na(tmpVar)) == length(tmpVar)){
missingVar<-c(missingVar,item)
} else {
colVar<-rainbow(numVar)[tmpVar] ### convert to colours for plotting purposes
<<!!>> plotCols<-data.frame(plotCols, colVar, stringsAsFactors = FALSE)
legendParams[[item]]<-cbind(levels(tmpVar),rainbow(numVar))
}
}
### remove from techVar
techVar<-techVar[!techVar %in% c(uniqueVar, missingVar)]
bioVar:
if(numVar == nrow(QCmetrics)){
uniqueVar<-c(uniqueVar, item)
} else {
if(sum(is.na(tmpVar)) == length(tmpVar)){
missingVar<-c(missingVar,item)
} else {
colVar<-rainbow(numVar)[tmpVar] ### convert to colours for plotting purposes
<<!!>> plotCols<-data.frame(plotCols, colVar, stringsAsFactors = FALSE)
legendParams[[item]]<-cbind(levels(tmpVar),rainbow(numVar))
}
}
### remove from bioVar
bioVar<-bioVar[!bioVar %in% c(uniqueVar, missingVar)]
Here's the relevant part of Darren's config.R:
## technical variables
techVar<-c("Sentrix_ID","Sentrix_Position","Plate", "Chip_ID", "Chip_Location")
## biological variables
bioVar<-c("Individual_ID","Sex", "Age")
Further analysis:
# after the '##remove from ...' commands
length(BioVar) + length(techVar) = 5
# but plotCols has 5 columns, 1 being Basename
Basename
exists in Darren's sampleSheet.csv and QCmetrics.rdata. So you can rule that one out.
Also, all values in techVar
and bioVar
match (between the config and the QCmetrics.rdata) [log file concurs]
Config file has
techVar<-c("Sentrix_ID","Sentrix_Position", !! "Plate" !!, "Chip_ID", "Chip_Location")
But colnames(QCmetrics)
gives:
[1] !! "X...Plate" !! "Chip_ID"
[3] "Chip_Location" "Individual_ID"
[5] "family" "Sex"
[7] "Age" "Sample_Collection_Date"
[9] "Sample_Concentration_Status" "Sample_ID"
[11] "Sentrix_Position" "Sentrix_ID"
[13] "Basename" "M.median"
[15] "U.median" "intens.ratio"
[17] "intensPASS" "bisulfCon"
[19] "PC1_cp" "PC2_cp"
[21] "PC3_cp" "PC1_betas"
[23] "PC2_betas" "PC3_betas"
[25] "PC4_betas" "PC5_betas"
[27] "PC6_betas" "PC7_betas"
[29] "pFilter" "DNAmAge"
[31] "horvath.n_missing" "CCDNAmAge"
[33] "rmsd" "sdd"
[35] "sadd" "srms"
[37] "nNAs" "nNAsPer"
[39] "x.cp" "y.cp"
[41] "predSex.x" "predSex.y"
[43] "predSex"
Could this be it? Not sure why this isn't being caught by:
if(item %in% colnames(QCmetrics)){
...
} else {
print(paste("Column", item, "not found. Please check config file.")) ### change to hard stop
#quit(save = "no")
}
The log file doesn't have any mention of any columns not being found.
Update: @dmsoanes has found that this is a problem with their metadata. "" was found before "Plate" in the sampleSheet.csv when not using the locale en_US.utf8 (or other UTF-8 encodings I guess). The default locale for R on our HPC is "C", causing the "" character to be read as ... something (doesn't matter what, just that it was being read in).
This caused the read.csv
function (when reading the sample sheet) to view "Plate" as an invalid name. As such the read.csv
function ran make.names
on "Plate", converting it into "X...Plate", a valid name. "X...Plate" did not exist inside the techVar
in the config, resulting in the mismatch of columns.
I'm still not sure why this slipped through the check however.
if(length(techVar) > 0){
for(item in techVar){
### check if exists in QC metircs
if(item %in% colnames(QCmetrics)){
...
}
This check should have failed, but no sign of this can be found in the corresponding log files.
To see this yourself, run:
find /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN -type f -name "*.log" -o -name "*.err" | xargs grep "not found."
Regardless, because "X...Plate" is not in techVar
, the column name persists in techVar
after this line:
techVar<-techVar[!techVar %in% c(uniqueVar, missingVar)]
However, plotCols
never gets added to, so there is a mismatch in the number of columns.
Currently there is a call to use a stop
command instead of print
:
print(paste("Column", item, "not found. Please check config file.")) ### change to hard stop
I will amend this in a pull request.
Contact Details
d.m.soanes@exeter.ac.uk
What happened?
Microarray data was not cell sorted (small test dataset) The first part of the QC pipeline seemed to work fine: 1: Column names were correct 2: raw.gds and rawNorm.gds files were produced 3: calcQCMetrics.r worked (files attached) - although there was a warning: Sentrix position model failed, skipping not sure if that is important 4: The markdown script QC.rmd failed:
How can the bug be reproduced?
Uploaded config, output, log and sample sheet files - other files can be found at /lustre/projects/Research_Project-MRC190311/DNAm/Darren/OHSU_006PLT_JN config.txt PassQCStatusAllSamples.csv QCDNAdata_2556271.log QCMetricsPostSampleCheck.csv sampleSheet.csv config.r.txt QCDNAdata_2556271.err.txt
Relevant log output