COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
776 stars 164 forks source link

salmon taking unexpectedly long to map reads #830

Open charlesfoster opened 1 year ago

charlesfoster commented 1 year ago

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)? salmon

Describe the bug I'm working with 15 samples, with ~5Gb total reads per sample (90,000,000 to 100,000,000 reads, ~75 bp reads). I've tried running these samples through the nf-core/rnaseq pipeline, but the pipeline took an age to run before dying at the salmon quant steps. Some samples finished in about 12 minutes. Others timed out after 8+ hours.

[2023-02-22 16:33:49.782] [jointLog] [info] setting maxHashResizeThreads to 6
[2023-02-22 16:33:49.782] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2023-02-22 16:33:49.782] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2023-02-22 16:33:49.782] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2023-02-22 16:33:49.782] [jointLog] [info] parsing read library format
[2023-02-22 16:33:49.783] [jointLog] [info] There is 1 library.
[2023-02-22 16:33:49.784] [jointLog] [info] Loading pufferfish index
[2023-02-22 16:33:49.784] [jointLog] [info] Loading dense pufferfish index.
[2023-02-22 16:34:17.987] [jointLog] [info] done
[2023-02-22 16:34:18.037] [jointLog] [info] Index contained 245549 targets
[2023-02-22 16:34:19.908] [jointLog] [info] Number of decoys : 194
[2023-02-22 16:34:19.908] [jointLog] [info] First decoy index : 245314 
[2023-02-22 16:39:30.638] [jointLog] [info] Thread saw mini-batch with a maximum of 1.10% zero probability fragments
[2023-02-22 16:39:30.653] [jointLog] [info] Thread saw mini-batch with a maximum of 1.02% zero probability fragments
[2023-02-22 16:39:30.653] [jointLog] [info] Thread saw mini-batch with a maximum of 1.04% zero probability fragments
[2023-02-22 16:39:30.654] [jointLog] [info] Thread saw mini-batch with a maximum of 1.12% zero probability fragments
[2023-02-22 16:39:30.679] [jointLog] [info] Thread saw mini-batch with a maximum of 1.02% zero probability fragments
[2023-02-22 16:39:30.680] [jointLog] [info] Thread saw mini-batch with a maximum of 1.04% zero probability fragments
[2023-02-22 16:39:31.095] [jointLog] [info] Computed 862758 rich equivalence classes for further processing
[2023-02-22 16:39:31.095] [jointLog] [info] Counted 34607288 total reads in the equivalence classes 
[2023-02-22 16:39:31.099] [jointLog] [warning] 5.2431e-05% of fragments were shorter than the k used to build the index.
If this fraction is too large, consider re-building the index with a smaller k.
The minimum read size found was 26.

[2023-02-22 16:39:31.099] [jointLog] [info] Number of mappings discarded because of alignment score : 10896346
[2023-02-22 16:39:31.099] [jointLog] [info] Number of fragments entirely discarded because of alignment score : 2579830
[2023-02-22 16:39:31.099] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 2262090
[2023-02-22 16:39:31.099] [jointLog] [info] Number of fragments discarded because they have only dovetail (discordant) mappings to valid targets : 222960
[2023-02-22 16:39:31.099] [jointLog] [info] Mapping rate = 72.5797%

[2023-02-22 16:39:31.099] [jointLog] [info] finished quantifyLibrary()
[2023-02-22 16:39:31.099] [jointLog] [info] Starting optimizer
[2023-02-22 16:39:31.095] [fileLog] [info] 
At end of round 0
==================
Observed 47681748 total fragments (47681748 in most recent round)

[2023-02-22 16:39:31.343] [jointLog] [info] Marked 0 weighted equivalence classes as degenerate
[2023-02-22 16:39:31.387] [jointLog] [info] iteration = 0 | max rel diff. = 4339.45
[2023-02-22 16:39:31.726] [jointLog] [info] iteration 11, adjusting effective lengths to account for biases
[2023-02-22 16:40:27.779] [jointLog] [info] Computed expected counts (for bias correction)
[2023-02-22 16:40:27.779] [jointLog] [info] processed bias for 0.0% of the transcripts
[2023-02-22 16:40:33.273] [jointLog] [info] processed bias for 10.0% of the transcripts
[2023-02-22 16:40:39.173] [jointLog] [info] processed bias for 20.0% of the transcripts
[2023-02-22 16:40:44.721] [jointLog] [info] processed bias for 30.0% of the transcripts
[2023-02-22 16:40:50.705] [jointLog] [info] processed bias for 40.0% of the transcripts
[2023-02-22 16:40:56.911] [jointLog] [info] processed bias for 50.0% of the transcripts
[2023-02-22 16:41:03.696] [jointLog] [info] processed bias for 60.0% of the transcripts
[2023-02-22 16:41:10.168] [jointLog] [info] processed bias for 70.0% of the transcripts
[2023-02-22 16:41:17.126] [jointLog] [info] processed bias for 80.0% of the transcripts
[2023-02-22 16:41:22.471] [jointLog] [info] processed bias for 90.0% of the transcripts
[2023-02-22 16:41:29.827] [jointLog] [info] processed bias for 100.0% of the transcripts
[2023-02-22 16:41:32.850] [jointLog] [info] iteration = 100 | max rel diff. = 16.7408
[2023-02-22 16:41:36.182] [jointLog] [info] iteration = 200 | max rel diff. = 17.9954
[2023-02-22 16:41:39.516] [jointLog] [info] iteration = 300 | max rel diff. = 11.8957
[2023-02-22 16:41:42.847] [jointLog] [info] iteration = 400 | max rel diff. = 2.5997
[2023-02-22 16:41:46.185] [jointLog] [info] iteration = 500 | max rel diff. = 0.718418
[2023-02-22 16:41:49.508] [jointLog] [info] iteration = 600 | max rel diff. = 7.79084
[2023-02-22 16:41:52.851] [jointLog] [info] iteration = 700 | max rel diff. = 0.245996
[2023-02-22 16:41:56.186] [jointLog] [info] iteration = 800 | max rel diff. = 0.68426
[2023-02-22 16:41:59.522] [jointLog] [info] iteration = 900 | max rel diff. = 0.528129
[2023-02-22 16:42:02.858] [jointLog] [info] iteration = 1000 | max rel diff. = 0.241433
[2023-02-22 16:42:06.195] [jointLog] [info] iteration = 1100 | max rel diff. = 0.0741567
[2023-02-22 16:42:09.533] [jointLog] [info] iteration = 1200 | max rel diff. = 0.11409
[2023-02-22 16:42:12.875] [jointLog] [info] iteration = 1300 | max rel diff. = 3.98708
[2023-02-22 16:42:16.211] [jointLog] [info] iteration = 1400 | max rel diff. = 0.146333
[2023-02-22 16:42:19.531] [jointLog] [info] iteration = 1500 | max rel diff. = 0.118365
[2023-02-22 16:42:22.871] [jointLog] [info] iteration = 1600 | max rel diff. = 0.073492
[2023-02-22 16:42:26.203] [jointLog] [info] iteration = 1700 | max rel diff. = 0.201421
[2023-02-22 16:42:29.533] [jointLog] [info] iteration = 1800 | max rel diff. = 0.127739
[2023-02-22 16:42:32.868] [jointLog] [info] iteration = 1900 | max rel diff. = 0.042789
[2023-02-22 16:42:36.215] [jointLog] [info] iteration = 2000 | max rel diff. = 0.0303234
[2023-02-22 16:42:39.544] [jointLog] [info] iteration = 2100 | max rel diff. = 0.0413186
[2023-02-22 16:42:40.309] [jointLog] [info] iteration = 2124 | max rel diff. = 0.00863591
[2023-02-22 16:42:40.317] [jointLog] [info] Finished optimizer
[2023-02-22 16:42:40.317] [jointLog] [info] writing output 

[2023-02-22 16:42:40.556] [jointLog] [info] Computing gene-level abundance estimates
[2023-02-22 16:45:41.884] [jointLog] [info] There were 246511 transcripts mapping to 61552 genes
[2023-02-22 16:45:41.884] [jointLog] [info] NOTE: We recommend using tximport (https://bioconductor.org/packages/release/bioc/html/tximport.html) for aggregating transcript-level salmon abundance estimates to the gene level.  It is more versatile, exposes more features, and allows considering multi-sample information during aggregation.
[2023-02-22 16:45:42.113] [jointLog] [info] Aggregating expressions to gene level
[2023-02-22 16:45:42.215] [jointLog] [info] done
Version Info: This is the most recent version of salmon.
### salmon (selective-alignment-based) v1.9.0
### [ program ] => salmon 
### [ command ] => quant 
### [ geneMap ] => { /home/cfos/Documents/Collaboration/Ece/2023_Bulk/work/5d/a1220b107b2450d88e8e92fa0f3c06/Homo_sapiens.GRCh38.106.gtf }
### [ threads ] => { 6 }
### [ libType ] => { ISR }
### [ index ] => { /home/cfos/Documents/Collaboration/Ece/2023_Bulk/work/5d/a1220b107b2450d88e8e92fa0f3c06/salmon }
### [ mates1 ] => { /home/cfos/Documents/Collaboration/Ece/2023_Bulk/work/5d/a1220b107b2450d88e8e92fa0f3c06/ACV_REP2_1_val_1.fq.gz }
### [ mates2 ] => { /home/cfos/Documents/Collaboration/Ece/2023_Bulk/work/5d/a1220b107b2450d88e8e92fa0f3c06/ACV_REP2_2_val_2.fq.gz }
### [ seqBias ] => { }
### [ gcBias ] => { }
### [ posBias ] => { }
### [ output ] => { ACV_REP2 }
Logs will be written to ACV_REP2/logs
[2023-02-23 09:39:48.709] [jointLog] [info] setting maxHashResizeThreads to 6
[2023-02-23 09:39:48.709] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2023-02-23 09:39:48.709] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2023-02-23 09:39:48.709] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2023-02-23 09:39:48.709] [jointLog] [info] parsing read library format
[2023-02-23 09:39:48.709] [jointLog] [info] There is 1 library.
[2023-02-23 09:39:48.709] [jointLog] [info] Loading pufferfish index
[2023-02-23 09:39:48.709] [jointLog] [info] Loading dense pufferfish index.
-----------------------------------------
| Loading contig table | Time = 15.056 s
-----------------------------------------
size = 37280289
-----------------------------------------
| Loading contig offsets | Time = 61.965 ms
-----------------------------------------
-----------------------------------------
| Loading reference lengths | Time = 513.44 us
-----------------------------------------
-----------------------------------------
| Loading mphf table | Time = 755.36 ms
-----------------------------------------
size = 3783918493
Number of ones: 37280288
Number of ones per inventory item: 512
Inventory entries filled: 72814
-----------------------------------------
| Loading contig boundaries | Time = 4.2405 s
-----------------------------------------
size = 3783918493
-----------------------------------------
| Loading sequence | Time = 387.95 ms
-----------------------------------------
size = 2665509853
-----------------------------------------
| Loading positions | Time = 4.3613 s
-----------------------------------------
size = 3516045923
-----------------------------------------
| Loading reference sequence | Time = 360.88 ms
-----------------------------------------
-----------------------------------------
| Loading reference accumulative lengths | Time = 985.04 us
-----------------------------------------
[2023-02-23 09:40:13.935] [jointLog] [info] done

(taken from the terminal as the logfile is empty, and the current time is 12:54 pm = >3 hr run time so far)

To Reproduce I ran the following command:

salmon quant \
      --geneMap Homo_sapiens.GRCh38.106.gtf  \
      --threads 6 \
      --libType=ISR \
      --index salmon_index \
      -1 ACV_REP2_1_val_1.fq.gz -2 CV_REP2_2_val_2.fq.gz \
      --seqBias --gcBias --posBias \
      -o ACV_REP2

Expected behavior All samples with similar numbers of reads using the same index to finish in roughly the same amount of time.

Screenshots If applicable, add screenshots or terminal output to help explain your problem.

Desktop (please complete the following information):

$lsb_release -a No LSB modules are available. Distributor ID: Pop Description: Pop!_OS 22.04 LTS Release: 22.04 Codename: jammy


**Additional context**
- I can't share these reads publicly but might be able to share personally (but I'd have to ask first)

- I tried using the suggestion from previous issues (e.g. [--hitFilterPolicy BOTH](https://github.com/COMBINE-lab/salmon/issues/527)) to use `--hitFilterPolicy BOTH` but it didn't seem to help.

- I tried trimming polyX tails with `fastp` but that didn't help either.

- To try and see if `salmon` was causing the issues, I thought I'd try `kallisto` as a similar-ish comparison. `kallisto` mapped the reads with 100 EM bootstraps in about 25 minutes. Commands:

kallisto index -i transcripts.idx genome.transcripts.fa kallisto quant -i transcripts.idx -o output -b 100 <(zcat ${R1}) <(zcat ${R2}) -t 6



Thanks for the tool - hopefully this is easy to sort out!
rob-p commented 1 year ago

Hi @charlesfoster,

Thanks for opening the issue. The second run looks like it ends before there is any information about mapped reads. Have mappings started to be reported at that point? I wonder if there is some issue related to the loading of the index. I have a few suggestions that may be worthwhile to try:

1) Do you observe the same problem if you index only the transcriptome (i.e. if you don't also include the genome as a decoy)?

2) If you are using nfcore/rnaseq you can also consider using the STAR => salmon path. Of course, I'm interested in addressing whatever the underlying issue here is anyway, but it's worth noting that this may be a viable alternative to allow you to process all of these samples using the nfcore pipeline in the meantime. This will align the reads to the genome using STAR (which gives the benefit of having a full decoy), project them to the transcriptome, and then quantify them.

Also, if you can share a set of problematic reads (or even a subset of them that will reproduce the extreme slowness problem) privately, that would be very helpful in debugging. In addition to trying to debug what's going on here, I'd probably also try running them through piscem. While this isn't yet an actual substitute for salmon, it will help isolate if the problem is directly related to the index or something else.

Thanks! Rob

charlesfoster commented 1 year ago

Hi @rob-p,

Thanks for the speedy reply! There are definitely some strange things going on here. I can confirm that the second run (and the others that timed out) didn't produce any information about mapping. The outdir only contained empty subdirs + an empty log file:

image

  1. Thanks for the suggestion, I've now been investigating the potential of an index-related issue.

Firstly, I downloaded the pre-built salmon index from refgenie using refgenie pull hg38/salmon_sa_index. I then ran salmon quant using this index and the singularity image of salmon v1.9.0. What, would you know: it worked in about 11 minutes.

<truncated>
[2023-02-23 14:46:31.892] [jointLog] [info] Aggregating expressions to gene level
[2023-02-23 14:46:32.452] [jointLog] [info] done

This pre-built index does appear to be decoy-aware:

[2023-02-23 14:38:21.709] [jointLog] [info] Number of decoys : 195
[2023-02-23 14:38:21.709] [jointLog] [info] First decoy index : 177412 

Secondly, I created a new transcriptome-only salmon index (singularity run -B /data $SALMON_SIMG salmon index -t genome.transcripts.fa -i salmon_index -k 31), then ran salmon quant again (as above) but using the new transcriptome-only index. Note: 'genome.transcripts.fa' is the transcripts file created during the nf-core/rnaseq pipeline. Again, this analysis completed properly in a reasonable time.

Seems like there is something very wrong with the 'gentrome.fa' file that's being created by nf-core/rnaseq! It's just so odd that some samples would work and others wouldn't.

  1. It's definitely worth noting that I originally opted against using star_salmon with the following command:
nextflow run nf-core/rnaseq --max_memory 55.GB --fasta /data/reference_genomes/GRCh38/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz --gtf /data/reference_genomes/GRCh38/Homo_sapiens.GRCh38.106.gtf.gz --skip_alignment --pseudo_aligner salmon --seq_center 'Ramaciotti Centre for Genomics' --input samplesheet.csv --outdir nf-core_results --save_merged_fastq true --skip_markduplicates true --extra_salmon_quant_args '--seqBias --gcBias --posBias' -profile singularity

I'll re-run (a) using the refgenie salmon index specified; (b) with the star_salmon pathway to see if the decoy-aware index created that way is appropriate.

  1. Other I've installed piscem and can give it a go, although it does seem more like a salmon index issue with nf-core/rnaseq from the debugging above. Do you agree? If so, I'll raise an issue there.

Considering this, would it still be useful to have access to the reads? I've got the green light to share them if need be. If so, what's a good contact address to share a OneDrive link?

Thanks! Charles

p.s. something else odd that I can dig into further later if need be is that the singularity version of salmon created an index in about 5 minutes, yet the conda version has been creating the index for nearly 20 minutes so far with no change...

rob-p commented 1 year ago

Hi Charles,

Thanks for the super-detailed response! This behavior is very interesting indeed.

I still think having the ability to look at the reads might be useful. You could share a link with rob@cs.umd.edu. Also, I agree with raising this issue with the nfcore folks. It seems like there is some sort of strange interaction of ecosystems going on here, as we have also (independently) noticed some wonky behavior with bioconda builds of salmon recently. Actually, I'm just investigating a related issue there now.

Thanks! Rob

charlesfoster commented 1 year ago

Cool- I've created the issue over at nf-core (https://github.com/nf-core/rnaseq/issues/948), and have emailed you a link to the reads. Let's hope this is all trivial!

Cheers, Charles