Closed Nicolai-vKuegelgen closed 2 years ago
Nicolai von Kügelgen notifications@github.com writes:
After adding some some samples to an already existing project, I just reran pigx pipeline to process them the same way. However, counts_from_salmon / tximport crashed:
1 2 3 4 5 6 7 8 9 10 11 12 Error in tximport::tximport(files, type = "salmon", txOut = TRUE) : all(txId == raw[[txIdCol]]) is not TRUE
This looks similar to the problem reported here, which was caused by salmon output files from different versions. However, since this was all done with the same pigx version (0.0.10), I don't see how the salmon version could have changed (unless guix bundles always the most recent salmon version with pigx?).
We do indeed use the latest version of Salmon by default.
Our tests ought to be adjusted to check for the expected file format, so that we can catch this kind of incompatibility when upgrading.
We can make PiGx use an older version of Salmon, but I’d prefer to make it do the right thing with the latest version instead.
I've also tried to used the guix R version in the pipeline (by using the full paths provided by the log files), but could not load tximport from those - I'm not sure if that means the guix installation is loading tximport from some weird place or if I just didn't load R properly when I wanted to test this interactively.
It could be enlightening if you could share relevant output.
R uses the R_LIBS_SITE environment variable to search for packages. In PiGx we set this variable to a well-known value, which in all likelihood is not set when you run R manually.
Additionally, R loads a bunch of files from a user’s home directory (there’s nothing special about how R from Guix does things) by default, which takes precedence over site files.
If you want to avoid most of these problems caused by state I suggest using a tightly constrained environment with “guix environment --container” in a directory other than your home directory. This virtualizes the file system and only mounts the current directory into the container, so there’s no chance it can load whatever state happens to be in your home directory.
-- Ricardo
Hello @rekado et al.,
I just also ran into this and presume the problem to be on the interpretation of the name of the sample. pigx expects the output name to match the filename of the forward read of the input (having paired reads, which may also trigger this in some version of salmon) but the output that salmon gives is the name of the sample.
$ head -n 5 samples.csv
name,reads,reads2,sample_type,Ersatz,JM,TGFb,sample
158320,A006200190_158320_S60_L002_R1_001.fastq.gz,A006200190_158320_S60_L002_R2_001.fastq.gz,Control,R52rKF1T4P,yes,0,0,R52rKF1T4P
158323,A006200190_158323_S61_L002_R1_001.fastq.gz,A006200190_158323_S61_L002_R2_001.fastq.gz,JM_24h,R52rKF1T4P,yes,24,0,R52rKF1T4P
158325,A006200190_158325_S62_L002_R1_001.fastq.gz,A006200190_158325_S62_L002_R2_001.fastq.gz,TGF,R52rKF1T4P,yes,0,48,R52rKF1T4P
158327,A006200190_158327_S63_L002_R1_001.fastq.gz,A006200190_158327_S63_L002_R2_001.fastq.gz,TGF_JM_1h,R52rKF1T4P,yes,1,48,R52rKF1T4P
158329,A006200190_158329_S64_L002_R1_001.fastq.gz,A006200190_158329_S64_L002_R2_001.fastq.gz,TGF_JM_3h,R52rKF1T4P,yes,3,48,R52rKF1T4P
and the output of salmon is
ls output/salmon_output
158320 158325 158329 158333 158337 158342 158346 158350 158354 158358 158362 158366
158323 158327 158331 158335 158340 158344 158348 158352 158356 158360 158364 158368
The initial error I got is the same as reported by @Nicolai-vKuegelgen:
Error in tximport::tximport(files, type = "salmon", txOut = TRUE) :
all(file.exists(files)) is not TRUE
Calls: writeCounts -> <Anonymous> -> stopifnot
Execution halted
..../output/logs/salmon/salmon_import_counts.log (END)
I then patched tximport to be a bit more verbose about what files are missing and got
$ less output/logs/salmon/salmon_import_counts.log
Error in tximport::tximport(files, type = "salmon", txOut = TRUE) :
The following files are missing:
.../SamplesRabbit/output/salmon_output/A006200190_158320_S60_L002_R1_001.fastq.gz/quant.sf
.../output/salmon_output/A006200190_158323_S61_L002_R1_001.fastq.gz/quant.sf
.../output/salmon_output/A006200190_158325_S62_L002_R1_001.fastq.gz/quant.sf
.../output/salmon_output/A006200190_158327_S63_L002_R1_001.fastq.gz/quant.sf
.../output/salmon_output/A006200190_158329_S64_L002_R1_001.fastq.gz/quant.sf
.../output/salmon_output/A006200190_158331_S65_L002_R1_001.fastq.gz/quant.sf
.../output/salmon_output/A006200190_158333_S66_L002_R1_001.fastq.gz/
Calls: writeCounts -> <Anonymous>
Execution halted
I presume that stop() has somehow limited the number of lines shown in the stop message.
The version of salmon I use is
$ salmon --version
salmon 1.5.2
which is not the ultimately latest but likely just the one you want me to have. I happily test things for you - ping me.
I just tried with my guix install and got he same:
$ less output/logs/salmon/salmon_import_counts.log
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
6: Setting LC_PAPER failed, using "C"
7: Setting LC_MEASUREMENT failed, using "C"
Error in tximport::tximport(files, type = "salmon", txOut = TRUE) :
all(file.exists(files)) is not TRUE
Calls: writeCounts -> <Anonymous> -> stopifnot
Execution halted
$ which pigx-rnaseq
/mariner/home/moeller/.guix-profile/bin/pigx-rnaseq
The version of salmon that is used by pigx-rnaseq is the same 1.5.2 from Debian - but this has worked before with a single read project.
I think I found it. The sample sheet had a column more than there were headers. This made read.csv presume that the first column (except for the shorter row with headers) are the initial row names which should in the transformation script then be substitute with what then gets the "name" column tag, which are the reads. @Nicolai-vKuegelgen , please check if your sample sheet has more column than headers, maybe it ends with a "," - you get the idea.
Example:
$ cat > foo.csv
a,b,c
1,2,3,4
11,22,33,44
111,222,333,444
but from within R
> f <- read.csv("foo.csv")
> f$b
[1] 3 33 333
Yup. That was it. I am now at the collate_read_counts
problem again from https://github.com/BIMSBbioinfo/pigx_rnaseq/pull/113 - but that change does not seem to fix it.
Error in merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE) : x has some duplicated column name(s): V2.x,V2.y.
Please remove or rename the duplicate(s) and try again.
Calls: as.data.frame -> Reduce -> f -> merge -> merge.data.table
In addition: Warning message: In merge.data.table(dtf1, dtf2, by = "V1", all.x = TRUE) : column names 'V2.x', 'V2.y' are duplicated in the result
Execution halted
and this time both with the guix and the Debian installation (two different machines). I'll investigate.
This issue #81 here I suggest to address with a bit more of a validation of the sample sheet.
This should fix this issue: https://github.com/BIMSBbioinfo/pigx_rnaseq/pull/124
After adding some some samples to an already existing project, I just reran pigx pipeline to process them the same way. However, counts_from_salmon / tximport crashed:
This looks similar to the problem reported here, which was caused by salmon output files from different versions. However, since this was all done with the same pigx version (0.0.10), I don't see how the salmon version could have changed (unless guix bundles always the most recent salmon version with pigx?).
In the end just removing all salmon output files & rerunning then solved the issue.
I've also tried to used the guix R version in the pipeline (by using the full paths provided by the log files), but could not load tximport from those - I'm not sure if that means the guix installation is loading tximport from some weird place or if I just didn't load R properly when I wanted to test this interactively.