BIMSBbioinfo / pigx_rnaseq

Bulk RNA-seq Data Processing, Quality Control, and Downstream Analysis Pipeline
GNU General Public License v3.0
20 stars 11 forks source link

counts from salmon / tximport crash when adding new samples to project #81

Closed Nicolai-vKuegelgen closed 2 years ago

Nicolai-vKuegelgen commented 3 years ago

After adding some some samples to an already existing project, I just reran pigx pipeline to process them the same way. However, counts_from_salmon / tximport crashed:

1 2 3 4 5 6 7 8 9 10 11 12 Error in tximport::tximport(files, type = "salmon", txOut = TRUE) : 
  all(txId == raw[[txIdCol]]) is not TRUE 

This looks similar to the problem reported here, which was caused by salmon output files from different versions. However, since this was all done with the same pigx version (0.0.10), I don't see how the salmon version could have changed (unless guix bundles always the most recent salmon version with pigx?).

In the end just removing all salmon output files & rerunning then solved the issue.

I've also tried to used the guix R version in the pipeline (by using the full paths provided by the log files), but could not load tximport from those - I'm not sure if that means the guix installation is loading tximport from some weird place or if I just didn't load R properly when I wanted to test this interactively.

rekado commented 3 years ago

Nicolai von Kügelgen notifications@github.com writes:

After adding some some samples to an already existing project, I just reran pigx pipeline to process them the same way. However, counts_from_salmon / tximport crashed:

1 2 3 4 5 6 7 8 9 10 11 12 Error in tximport::tximport(files, type = "salmon", txOut = TRUE) : 
  all(txId == raw[[txIdCol]]) is not TRUE 

This looks similar to the problem reported here, which was caused by salmon output files from different versions. However, since this was all done with the same pigx version (0.0.10), I don't see how the salmon version could have changed (unless guix bundles always the most recent salmon version with pigx?).

We do indeed use the latest version of Salmon by default.

Our tests ought to be adjusted to check for the expected file format, so that we can catch this kind of incompatibility when upgrading.

We can make PiGx use an older version of Salmon, but I’d prefer to make it do the right thing with the latest version instead.

I've also tried to used the guix R version in the pipeline (by using the full paths provided by the log files), but could not load tximport from those - I'm not sure if that means the guix installation is loading tximport from some weird place or if I just didn't load R properly when I wanted to test this interactively.

It could be enlightening if you could share relevant output.

R uses the R_LIBS_SITE environment variable to search for packages. In PiGx we set this variable to a well-known value, which in all likelihood is not set when you run R manually.

Additionally, R loads a bunch of files from a user’s home directory (there’s nothing special about how R from Guix does things) by default, which takes precedence over site files.

If you want to avoid most of these problems caused by state I suggest using a tightly constrained environment with “guix environment --container” in a directory other than your home directory. This virtualizes the file system and only mounts the current directory into the container, so there’s no chance it can load whatever state happens to be in your home directory.

-- Ricardo

smoe commented 2 years ago

Hello @rekado et al.,

I just also ran into this and presume the problem to be on the interpretation of the name of the sample. pigx expects the output name to match the filename of the forward read of the input (having paired reads, which may also trigger this in some version of salmon) but the output that salmon gives is the name of the sample.

$ head -n 5 samples.csv
name,reads,reads2,sample_type,Ersatz,JM,TGFb,sample
158320,A006200190_158320_S60_L002_R1_001.fastq.gz,A006200190_158320_S60_L002_R2_001.fastq.gz,Control,R52rKF1T4P,yes,0,0,R52rKF1T4P
158323,A006200190_158323_S61_L002_R1_001.fastq.gz,A006200190_158323_S61_L002_R2_001.fastq.gz,JM_24h,R52rKF1T4P,yes,24,0,R52rKF1T4P
158325,A006200190_158325_S62_L002_R1_001.fastq.gz,A006200190_158325_S62_L002_R2_001.fastq.gz,TGF,R52rKF1T4P,yes,0,48,R52rKF1T4P
158327,A006200190_158327_S63_L002_R1_001.fastq.gz,A006200190_158327_S63_L002_R2_001.fastq.gz,TGF_JM_1h,R52rKF1T4P,yes,1,48,R52rKF1T4P
158329,A006200190_158329_S64_L002_R1_001.fastq.gz,A006200190_158329_S64_L002_R2_001.fastq.gz,TGF_JM_3h,R52rKF1T4P,yes,3,48,R52rKF1T4P

and the output of salmon is

ls output/salmon_output
158320  158325  158329  158333  158337  158342  158346  158350  158354  158358  158362  158366
158323  158327  158331  158335  158340  158344  158348  158352  158356  158360  158364  158368

The initial error I got is the same as reported by @Nicolai-vKuegelgen:

Error in tximport::tximport(files, type = "salmon", txOut = TRUE) :
  all(file.exists(files)) is not TRUE
Calls: writeCounts -> <Anonymous> -> stopifnot
Execution halted
..../output/logs/salmon/salmon_import_counts.log (END)

I then patched tximport to be a bit more verbose about what files are missing and got

$ less output/logs/salmon/salmon_import_counts.log
Error in tximport::tximport(files, type = "salmon", txOut = TRUE) :
  The following files are missing:
  .../SamplesRabbit/output/salmon_output/A006200190_158320_S60_L002_R1_001.fastq.gz/quant.sf
  .../output/salmon_output/A006200190_158323_S61_L002_R1_001.fastq.gz/quant.sf
  .../output/salmon_output/A006200190_158325_S62_L002_R1_001.fastq.gz/quant.sf
  .../output/salmon_output/A006200190_158327_S63_L002_R1_001.fastq.gz/quant.sf
  .../output/salmon_output/A006200190_158329_S64_L002_R1_001.fastq.gz/quant.sf
  .../output/salmon_output/A006200190_158331_S65_L002_R1_001.fastq.gz/quant.sf
  .../output/salmon_output/A006200190_158333_S66_L002_R1_001.fastq.gz/
Calls: writeCounts -> <Anonymous>
Execution halted

I presume that stop() has somehow limited the number of lines shown in the stop message.

The version of salmon I use is

$ salmon --version
salmon 1.5.2

which is not the ultimately latest but likely just the one you want me to have. I happily test things for you - ping me.

smoe commented 2 years ago

I just tried with my guix install and got he same:

$ less output/logs/salmon/salmon_import_counts.log
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
6: Setting LC_PAPER failed, using "C"
7: Setting LC_MEASUREMENT failed, using "C"
Error in tximport::tximport(files, type = "salmon", txOut = TRUE) :
  all(file.exists(files)) is not TRUE
Calls: writeCounts -> <Anonymous> -> stopifnot
Execution halted
$ which pigx-rnaseq
/mariner/home/moeller/.guix-profile/bin/pigx-rnaseq

The version of salmon that is used by pigx-rnaseq is the same 1.5.2 from Debian - but this has worked before with a single read project.

smoe commented 2 years ago

I think I found it. The sample sheet had a column more than there were headers. This made read.csv presume that the first column (except for the shorter row with headers) are the initial row names which should in the transformation script then be substitute with what then gets the "name" column tag, which are the reads. @Nicolai-vKuegelgen , please check if your sample sheet has more column than headers, maybe it ends with a "," - you get the idea.

Example:

$ cat > foo.csv
a,b,c
1,2,3,4
11,22,33,44
111,222,333,444

but from within R

> f <- read.csv("foo.csv")
> f$b
[1]   3  33 333
smoe commented 2 years ago

Yup. That was it. I am now at the collate_read_counts problem again from https://github.com/BIMSBbioinfo/pigx_rnaseq/pull/113 - but that change does not seem to fix it.

Error in merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE) :    x has some duplicated column name(s): V2.x,V2.y.
Please remove or rename the duplicate(s) and try again.
Calls: as.data.frame -> Reduce -> f -> merge -> merge.data.table
In addition: Warning message: In merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE) :   column names 'V2.x', 'V2.y' are duplicated in the result
Execution halted

and this time both with the guix and the Debian installation (two different machines). I'll investigate.

This issue #81 here I suggest to address with a bit more of a validation of the sample sheet.

borauyar commented 2 years ago

This should fix this issue: https://github.com/BIMSBbioinfo/pigx_rnaseq/pull/124