Open charlesfoster opened 1 year ago
Hi @charlesfoster,
Thanks for opening the issue. The second run looks like it ends before there is any information about mapped reads. Have mappings started to be reported at that point? I wonder if there is some issue related to the loading of the index. I have a few suggestions that may be worthwhile to try:
1) Do you observe the same problem if you index only the transcriptome (i.e. if you don't also include the genome as a decoy)?
2) If you are using nfcore/rnaseq you can also consider using the STAR => salmon path. Of course, I'm interested in addressing whatever the underlying issue here is anyway, but it's worth noting that this may be a viable alternative to allow you to process all of these samples using the nfcore pipeline in the meantime. This will align the reads to the genome using STAR (which gives the benefit of having a full decoy), project them to the transcriptome, and then quantify them.
Also, if you can share a set of problematic reads (or even a subset of them that will reproduce the extreme slowness problem) privately, that would be very helpful in debugging. In addition to trying to debug what's going on here, I'd probably also try running them through piscem. While this isn't yet an actual substitute for salmon, it will help isolate if the problem is directly related to the index or something else.
Thanks! Rob
Hi @rob-p,
Thanks for the speedy reply! There are definitely some strange things going on here. I can confirm that the second run (and the others that timed out) didn't produce any information about mapping. The outdir only contained empty subdirs + an empty log file:
Firstly, I downloaded the pre-built salmon index from refgenie using refgenie pull hg38/salmon_sa_index
. I then ran salmon quant
using this index and the singularity image of salmon v1.9.0. What, would you know: it worked in about 11 minutes.
<truncated>
[2023-02-23 14:46:31.892] [jointLog] [info] Aggregating expressions to gene level
[2023-02-23 14:46:32.452] [jointLog] [info] done
This pre-built index does appear to be decoy-aware:
[2023-02-23 14:38:21.709] [jointLog] [info] Number of decoys : 195
[2023-02-23 14:38:21.709] [jointLog] [info] First decoy index : 177412
Secondly, I created a new transcriptome-only salmon index (singularity run -B /data $SALMON_SIMG salmon index -t genome.transcripts.fa -i salmon_index -k 31
), then ran salmon quant
again (as above) but using the new transcriptome-only index. Note: 'genome.transcripts.fa' is the transcripts file created during the nf-core/rnaseq
pipeline. Again, this analysis completed properly in a reasonable time.
Seems like there is something very wrong with the 'gentrome.fa' file that's being created by nf-core/rnaseq
! It's just so odd that some samples would work and others wouldn't.
star_salmon
with the following command:nextflow run nf-core/rnaseq --max_memory 55.GB --fasta /data/reference_genomes/GRCh38/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz --gtf /data/reference_genomes/GRCh38/Homo_sapiens.GRCh38.106.gtf.gz --skip_alignment --pseudo_aligner salmon --seq_center 'Ramaciotti Centre for Genomics' --input samplesheet.csv --outdir nf-core_results --save_merged_fastq true --skip_markduplicates true --extra_salmon_quant_args '--seqBias --gcBias --posBias' -profile singularity
I'll re-run (a) using the refgenie salmon index specified; (b) with the star_salmon
pathway to see if the decoy-aware index created that way is appropriate.
piscem
and can give it a go, although it does seem more like a salmon index issue with nf-core/rnaseq
from the debugging above. Do you agree? If so, I'll raise an issue there.Considering this, would it still be useful to have access to the reads? I've got the green light to share them if need be. If so, what's a good contact address to share a OneDrive link?
Thanks! Charles
p.s. something else odd that I can dig into further later if need be is that the singularity version of salmon created an index in about 5 minutes, yet the conda version has been creating the index for nearly 20 minutes so far with no change...
Hi Charles,
Thanks for the super-detailed response! This behavior is very interesting indeed.
I still think having the ability to look at the reads might be useful. You could share a link with rob@cs.umd.edu
. Also, I agree with raising this issue with the nfcore
folks. It seems like there is some sort of strange interaction of ecosystems going on here, as we have also (independently) noticed some wonky behavior with bioconda builds of salmon recently. Actually, I'm just investigating a related issue there now.
Thanks! Rob
Cool- I've created the issue over at nf-core (https://github.com/nf-core/rnaseq/issues/948), and have emailed you a link to the reads. Let's hope this is all trivial!
Cheers, Charles
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)? salmon
Describe the bug I'm working with 15 samples, with ~5Gb total reads per sample (90,000,000 to 100,000,000 reads, ~75 bp reads). I've tried running these samples through the
nf-core/rnaseq
pipeline, but the pipeline took an age to run before dying at thesalmon quant
steps. Some samples finished in about 12 minutes. Others timed out after 8+ hours.(taken from the terminal as the logfile is empty, and the current time is 12:54 pm = >3 hr run time so far)
To Reproduce I ran the following command:
nf-core/rnaseq
: via singularity; while running manually to troubleshoot: conda.Expected behavior All samples with similar numbers of reads using the same index to finish in roughly the same amount of time.
Screenshots If applicable, add screenshots or terminal output to help explain your problem.
Desktop (please complete the following information):
$lsb_release -a No LSB modules are available. Distributor ID: Pop Description: Pop!_OS 22.04 LTS Release: 22.04 Codename: jammy
kallisto index -i transcripts.idx genome.transcripts.fa kallisto quant -i transcripts.idx -o output -b 100 <(zcat ${R1}) <(zcat ${R2}) -t 6