COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
779 stars 165 forks source link

Alevin failing to quantitate Split-seq reads #867

Open jmerkin opened 1 year ago

jmerkin commented 1 year ago

Hi everyone. As always, thanks for the great tools.

I have some split-seq data (from Parse Bio) that ran through the Parse Bio pipeline already (1 moving part at a time). I am trying to run it through Alevin and it keeps failing with printing to the screen:

[2023-08-17 22:13:22.207] [alevinLog] [info] Done barcode density calculation. [2023-08-17 22:13:22.207] [alevinLog] [info] # Barcodes Used: 15722231 / 15722593. [2023-08-17 22:13:22.910] [alevinLog] [info] Total 545(has 201 low confidence) barcodes [2023-08-17 22:13:23.660] [alevinLog] [info] Done True Barcode Sampling (some lines later) [2023-08-17 22:14:04.046] [jointLog] [info] Computed 0 rich equivalence classes for further processing [2023-08-17 22:14:04.046] [jointLog] [info] Counted 0 total reads in the equivalence classes [2023-08-17 22:14:04.047] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0 [2023-08-17 22:14:04.083] [jointLog] [info] Mapping rate = 0%

It ultimately dies with a floating point error, probably for dividing by 0 somewhere.

The command I'm running is (and note that I have r1 and r2 swapped, per some other guidance. It dies earlier during barcode processing otherwise):

salmon alevin -i /ref/gencode.v43.transcripts/ -l ISR -1 r2.fastq.gz -2 r1.fastq.gz -p 30 --splitseqV2 --tgMap /ref/gencode.v43_full.txMap -o salmon_output --expectCells 400

I tried to make it simpler and simpler, so that's a salmon index I made myself with no decoys, just the gencode transcript fasta with the software version I'm trying to run (salmon 1.10.2, from the combinelab/salmon docker hub), gencode v43 (I know, it's a version behind, but it's what I'm using elsewhere...) and hg38. It's handling the barcodes fine and recovering approximately the right amount (it's a sub-library with a few hundred cells to check the quality of the library before sequencing the full experiment). But it's failing to quantitate any of the reads.

Oddly, just quantitating the read1 file with

salmon quant -i /ref/gencode.v43.transcripts/ -l ISR -r r1.fastq.gz -p 30 -o work/salmon_output

behaves as expected and prints this and proceeds

[2023-08-17 22:21:17.619] [jointLog] [info] Computed 618403 rich equivalence classes for further processing [2023-08-17 22:21:17.619] [jointLog] [info] Counted 6085013 total reads in the equivalence classes [2023-08-17 22:21:17.632] [jointLog] [info] Number of mappings discarded because of alignment score : 166235099

I found a discussion about the initial implementation of splitseq here with this comment from @rob-p. I copied the geometry parameters he mentioned for this command:

salmon alevin -i /ref/gencode.v43.transcripts/ -l A -1 r2.fastq.gz -2 r1.fastq.gz -p 30 --read-geometry 2[1-end] --umi-geometry 1[1-10] --bc-geometry 1[11-18,49-56,79-86] --tgMap /ref/gencode.v43_full.txMap -o work/salmon_output --expectCells 400

And now it works!

[2023-08-17 22:32:53.318] [jointLog] [info] Computed 171365 rich equivalence classes for further processing [2023-08-17 22:32:53.318] [jointLog] [info] Counted 2039348 total reads in the equivalence classes [2023-08-17 22:32:53.319] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0

The data I'm analyzing are some we generated, but it doesn't seem to relate to the data itself. If necessary, I could share it I think. I think there must be an issue with how the --splitseqV2 flag is handled.

I think that answers all the questions you initially request, but I'm happy to share more info if helpful.