BIMSBbioinfo / pigx_rnaseq

Bulk RNA-seq Data Processing, Quality Control, and Downstream Analysis Pipeline
GNU General Public License v3.0
20 stars 11 forks source link

Spreadsheet-generated sample_sheet not properly importable - early warning? #111

Closed smoe closed 2 years ago

smoe commented 2 years ago

I ran into

output/logs/hisat2/collate_read_counts.log
Error in merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE) : 
  x has some duplicated column name(s): V2.x,V2.y. Please remove or rename the duplicate(s) and try again.
Calls: as.data.frame -> Reduce -> f -> merge -> merge.data.table
In addition: Warning message:
In merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE) :
  column names 'V2.x', 'V2.y' are duplicated in the result
Execution halted

I found a string that contained a comma, which may have induced an extra column? Another header was just on of those special characters that R likes to substitute. I'll find out if I have now fixed that problem. Either way, what is there that pigx may possibly want to do to confirm the validity of the sample sheet? I skimmed through your validate_input script's read_sample_sheet and would feel tempted to check the headers and for an equal length of all the rows. Reasonable? Shall I also warn about fields containing the delimiter - presuming that your csv reader respects quotes?

borauyar commented 2 years ago

Yes, I think that makes sense (checking for number of fields in each row) and the csv reader can be assumed to respect quotes.

smoe commented 2 years ago

I will prepare a respective PR. Sadly, I did these checks manually in the sample sheet but the sample sheet is not part of the input. Do you find anything suspicious in

$ head .../output/colData.tsv
"sample_type"   "group"
"1"     "hOF_control_control"   "hOF_control_control"
"2"     "hOF_TGF_control"       "hOF_TGF_control"
"3"     "hOF_control_Pirf"      "hOF_control_Pirf"
"4"     "hOF_TGF_Pirf"  "hOF_TGF_Pirf"
"5"     "hTF_control_control"   "hTF_control_control"
"6"     "hTF_TGF_control"       "hTF_TGF_control"
"7"     "hTF_control_Pirf"      "hTF_control_Pirf"
"8"     "hTF_TFG_Pirf"  "hTF_TFG_Pirf"
"9"     "hTF_control_control"   "hTF_control_control"
...

or in the .sf files as in

$ head .../output/salmon_output/110/quant.genes.sf
Name    Length  EffectiveLength TPM     NumReads
ENST00000426884.1       112     3       0       0
ENST00000457982.3       2291    2448.12 2.51626 83.754
ENST00000621768.1       822     571     0       0
ENST00000436895.1       1288    1037    0       0
ENST00000649395.1       614     363     0       0
ENST00000447882.1       704     453     0       0
ENST00000505954.1       962     711     0       0
ENST00000424379.1       260     23      0       0
ENST00000604816.1       211     10      0       0
...

?

I looked at the R code producing the error (/usr/libexec/pigx_rnaseq/scripts/collate_read_counts.R) but did not find anything utterly wrong with it. Would need to debug this now but happily receive directions.

To avoid stalling myself I will change from hisat to star in the mean time.

smoe commented 2 years ago

Sadly, I need to change machines to satisfy the memory requirements of STAR which I would now want to avoid. I consequently had a look at the collate_read_counts.R script and - may I ask for a simplification? And some extra checks?

At https://github.com/BIMSBbioinfo/pigx_rnaseq/blob/de9b37df88f701ada98c6ded2cc1d50f4111f319/scripts/collate_read_counts.R#L30 you read all data in in an lapply and as data frame when I thought that if you would do that with an sapply then you would be right about where you want to be, just return only the second column with the numbers and sapply will do the simplification for you to return this all as a single matrix.

I am a fan of the named lists, so I suggest to

names(counts) <- count_files

:)

Now I thought I want to make sure that all files have the same dimensions:

counts.dims <- sapply(counts,dim)

if(any(counts.dims[2,]!=2)) {
  cat("E: These files have more than two columns as read counts:\n")
  print(count_files[counts.dims[2,] != 2])
  stop(paste("The format of read_counts.csv files in ",input_dir," does not match expectations.",sep=""))
}

if(any(counts.dims[1,]!=counts.dims[1,1])) {
  cat("E: These files have a different number of genes than the first:\n")
  print(count_files[counts.dims[1,] != counts.dims[1,1]])
  stop(paste("The format of read_counts.csv files in ",input_dir," does not match expectations.",sep=""))
}

Maybe this could be done in a simpler way but it is what I had in my fingers.

To then create a matrix from it, this works for me:

# Check order of names in V1
m <- NULL
for(i in 1:length(counts)) {
   if(any(counts[[1]][,1] != counts[[i]][,1])) {
      stop(paste("Order of genes in ",count_files[i]," differs from order in first file ",count_files[1],".",sep=""))
   }
   m <- cbind(m,counts[[i]][,2])
}
rownames(m) <- as.matrix(counts[[1]][,1])[,1]
colnames(m) <- gsub(x=basename(count_files),pattern=".read_counts.csv$","")

Is this what the script shall produce? Then maybe we can go for that in some way? The current implementation I presume superior since the order of V1 I understand not to be required to be identical across samples but at least for hisat2 this does not seem to be a requirement. And, https://github.com/BIMSBbioinfo/pigx_rnaseq/blob/de9b37df88f701ada98c6ded2cc1d50f4111f319/scripts/collate_read_counts.R#L35 fails for me:

> counts_all <- as.data.frame(Reduce(function(dtf1, dtf2)
  merge(dtf1, dtf2, by = "V1", all.x = TRUE),
       counts))
Error in merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE) : 
  x has some duplicated column name(s): V2.x,V2.y. Please remove or rename the duplicate(s) and try again.
In addition: Warning message:
In merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE) :
  column names 'V2.x', 'V2.y' are duplicated in the result

@borauyar , please direct me. I presume you use this script both for hisat2 and STAR, so you cannot rely on the consistency of table lengths and gene order in the .csv files. Maybe have two files? Or you first check if the gene lists are conserved and then decide for one or the other implementation? Anyway - I'll keep debugging this.

Many thanks!

smoe commented 2 years ago

I can reproduce the error now without the embracing Reduce:

First - give me the first four elements of counts as a, b, c, d

a<-counts[[1]]
b<-counts[[2]]
c<-counts[[3]]
d<-counts[[4]]

Now, instead of cbind I use merge:

r<-merge(a,b,by="V1",all.x=TRUE)
r2<-merge(r,c,by="V1",all.x=TRUE)

So far so good, except - should this not be V2.z now ...

> r2
                    V1 V2.x V2.y  V2
    1:                    1   10 100
    2: ENSG00000000003  875  422 422
    3: ENSG00000000005    0    0   0
    4: ENSG00000000419  369  293 973
    5: ENSG00000000457  141  102 233
   ---                              
60661: ENSG00000288721    3    3   6
60662: ENSG00000288722  244  391 270
60663: ENSG00000288723    1    0   0
60664: ENSG00000288724    0    0   0
60665: ENSG00000288725    0    0   0

And now the error is triggered:

r3<-merge(r2,d,by="V1",all.x=TRUE)
Warning message:
In merge.data.table(r2, d, by = "V1", all.x = TRUE) :
  column names 'V2.x', 'V2.y' are duplicated in the result

The merge is from the base library:

     merge(x, y, by = intersect(names(x), names(y)),
           by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
           sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
           incomparables = NULL, ...)

and for the "no.dups" argument it reads:

 no.dups: logical indicating that ‘suffixes’ are appended in more cases
          to avoid duplicated column names in the result.  This was
          implicitly false before R version 3.5.0.

I am running R version 4.1.1. and something seems to have been broken in the meantime that still does not allow duplicates when no.dups is set to false. I now helped me out like this:

# merge list of data frames
labs <- apply(combn(LETTERS,3),2,paste,collapse="")
counts_all <- as.data.frame(Reduce(function(dtf1, dtf2) {
    #cat("I: ncol(dtf1): ",ncol(dtf1), "  ncol(dtf2): ",ncol(dtf2),"\n") # it works exactly like my cbind above
    colnames(dtf1)[2:ncol(dtf1)] <- labs[1:ncol(dtf1)-1]
    # it is sufficient to rename one of the two
    merge(dtf1, dtf2, by = "V1", all.x = TRUE)
  },
       counts))

This worked but a follow-up error then was

Error in rule norm_counts_deseq:
    jobid: 477
    output: /mariner/projects/HumanRabbitTFibrosisTimeSeries/output/feature_counts/normalized/hisat2/deseq_size_factors.txt, /mariner/projects/HumanRabbitTFibrosisTimeSeries/output/feature_counts/normalized/hisat2/deseq_normalized_counts.tsv
    log: /mariner/projects/HumanRabbitTFibrosisTimeSeries/output/logs/hisat2/norm_counts_deseq.log (check log file(s) for error message)
    shell:
        /usr/bin/Rscript --vanilla /usr/libexec/pigx_rnaseq/scripts/norm_counts_deseq.R /mariner/projects/HumanRabbitTFibrosisTimeSeries/output/feature_counts/raw_counts/hisat2/counts.tsv /mariner/projects/HumanRabbitTFibrosisTimeSeries/output/colData.tsv /mariner/projects/HumanRabbitTFibrosisTimeSeries/output/feature_counts/normalized/hisat2 >> /mariner/projects/HumanRabbitTFibrosisTimeSeries/output/logs/hisat2/norm_counts_deseq.log 2>&1
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mariner/projects/HumanRabbitTFibrosisTimeSeries/output/.snakemake/log/2021-10-22T211154.591373.snakemake.log
moeller@mariner3:/mariner/projects/HumanRabbitTFibrosisTimeSeries$ less /mariner/projects/HumanRabbitTFibrosisTimeSeries/output/logs/hisat2/norm_counts_deseq.log
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 1 did not have 95 elements
Calls: read.table -> scan

so I also added a "col.names=FALSE" to write.table. The normalisation with DEseq has just completed. I'll prepare a PR just with the minimal bits to get this to run with R 4.1.1.

rekado commented 2 years ago

Oof, thanks for reporting this, and for your patience debugging the issue!

Do you think we could add a simple test to detect problems like this in the future?

smoe commented 2 years ago

Oof, thanks for reporting this, and for your patience debugging the issue!

You guess where some good parts of my Friday went :) Felt good, though, typically my job is to find a problem, now I could fix one. My colleague run your pipeline mostly glitch-unencumbered just a week ago, and then I was careless with a system upgrade. As the maintainer of the Debian package of pigx-rnaseq I do not want to avoid this, though :)

Do you think we could add a simple test to detect problems like this in the future?

The unit test would have worked for collate if you apply it only three data sets :) It is a tricky one. I can help you with "head -n 20 $(find . -name "*.sf")" or so. But the transfer from one script to the next that I addressed with the col.names=NA you cannot really catch with a unit test, except, well you can of course check for the identity with an expected output.

My suggestion is to have a small experiment, say 5 vs 5, that you run from A to Z - not with every commit but maybe every week. And if that fails then you bisect. I cordially invite you to use Debian's CI. If I get this right then this is run whenever a dependency of pigx-rnaseq is updated.

borauyar commented 2 years ago

113 fixes the issue. Thank you @smoe!