COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
772 stars 162 forks source link

Alevin less than 3% mapped reads #763

Open SeBaBInf opened 2 years ago

SeBaBInf commented 2 years ago

Hi all

I have been working with the raw fastq files from this study, and tried to map them using Alevin.

following fastq-dump I generated the barcode fastq files using umi-tools, this is a 3' PE Chromium scRNASeq, so I used

umi_tools extract \
--stdin=GSE140511/fastq_files/SRR10480618_1.fastq.gz \
--stdout GSE140511/fastq_files/SRR10480618_BC_1.fastq.gz \
--read2-in GSE140511/fastq_files/SRR10480618_2.fastq.gz \
--read2-out GSE140511/fastq_files/SRR10480618_BC_2.fastq.gz \
--bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
--log=processed_SRR10480618.log

following this step I then ran it on Alevin with

salmon alevin \
-l ISR \
-1 GSE140511/fastq_files/SRR10480618_BC_1.fastq.gz GSE140511/fastq_files/SRR10480618_BC_2.fastq.gz \
-2 GSE140511/fastq_files/SRR10480618_1.fastq.gz GSE140511/fastq_files/SRR10480618_2.fastq.gz \
--chromium  \
-i /data/shared/seb_temp/Mm_index \
-p 32 \
-o /home/seb/shared/seb_temp/GSE140511/salmon_alevin_output/SRR10480618 \
--tgMap /home/seb/shared/seb_temp/txp2gene_SB.tsv

my txp2gene file looks like this

ENSMUST00000178537.2    ENSMUSG00000095668.2
ENSMUST00000178862.2    ENSMUSG00000094569.2
ENSMUST00000196221.2    ENSMUSG00000096749.3
ENSMUST00000179664.2    ENSMUSG00000096749.3
ENSMUST00000177564.2    ENSMUSG00000096176.2
ENSMUST00000179520.2    ENSMUSG00000094028.2
ENSMUST00000179883.2    ENSMUSG00000094552.2
ENSMUST00000195858.2    ENSMUSG00000096420.3
ENSMUST00000179932.2    ENSMUSG00000096420.3
ENSMUST00000180001.2    ENSMUSG00000095656.2

and I generated it using the gtf file created by indexing the genome.

however, when salmon alevin finishes its run, I get a mapping rate of 3% or less, and no errors are generated.

I then tried to pass the barcodes that are published in GEO for this sample as whitelist adding the following argument --whitelist GSM4173504_WT_1_barcodes.tsv but here the mapping rate goes down to 0%!!

I am not sure what's going on...any suggestion would be great.

rob-p commented 2 years ago

Hi @SeBaBInf,

Thanks for reporting this. I'm pinging @k3yavi for his thoughts here. Two quick thoughts though -- the first is that the abstract for this paper mentions 5' tagged end sequencing, thus it might be necessary to swap the reads so that the biological and technical reads are in the expected order. Second, it's likely also worth seeing if and how the data look different if you process with alevin-fry rather than alevin. I'll let @k3yavi provide more detailed guidance here.

--Rob

k3yavi commented 2 years ago

I guess there can be many things and it's hard to say without the logs, can you share the alevin and the salmon logs ? Like @rob-p mentioned you can try changing ISR to ISF and/or using alevin-fry and check if that makes a difference, although if it's about mapping then probably it won't matter.

I did notice one strange thing in the umi extract command though, please recheck this through the UMI tools package but you used --stdout GSE140511/fastq_files/SRR10480618_BC_1.fastq.gz and --read2-out GSE140511/fastq_files/SRR10480618_BC_2.fastq.gz in the umi tools command, which are the two files that probably should be provided as -1 and -2 flags to alevin. I am not sure why you are using GSE140511/fastq_files/SRR10480618_1.fastq.gz as the -2 flag, which looks wrong to me, may be that will solve the issue.

Hope it helps.

SeBaBInf commented 2 years ago

hi @k3yavi and @rob-p thank you for the quick reply.

so, I think I misunderstood the -1 and -2 order on the Alevin page that says verbatim :

salmon alevin -l ISR -1 lib_A_cb.fq.gz lib_B_cb.fq.gz -2 lib_A_read.fq.gz lib_B_read.fq.gz

It makes more sense now, it did seem strange.

I am also re-running umi-tools and generating a whitelist first. I will change ISR to ISF and then re-run salmon alevin and let you know if that solved the problem. thank you

SeBaBInf commented 2 years ago

so, I did fix the umi_tools commands as

umi_tools whitelist \
-I /home/GSE140511/fastq_files/SRR10480618_1.fastq.gz \
--bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNNNN \
--log=/home/GSE140511/SRR10480618_fq/SRR10480618_log_whitelist.log \
--log2stderr > /home/GSE140511/SRR10480618_fq/SRR10480618_whitelist.txt

umi_tools extract \
--bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNNNN \
--stdin /home/GSE140511/fastq_files/SRR10480618_1.fastq.gz \
--stdout /home/GSE140511/SRR10480618_fq/SRR10480618_BC_1.fastq.gz \
--read2-in /home/GSE140511/fastq_files/SRR10480618_2.fastq.gz \
--read2-out=/home/GSE140511/SRR10480618_fq/SRR10480618_BC_2.fastq.gz \
--whitelist=/home/GSE140511/SRR10480618_fq/SRR10480618_whitelist.txt

I swapped the reads as:

/salmon-1.8.0_linux_x86_64/bin/salmon alevin \
-l ISR \
-2 /home/GSE140511/SRR10480618_fq/SRR10480618_BC_trimmed_1.fastq.gz \
-1 /home/GSE140511/SRR10480618_fq/SRR10480618_BC_trimmed_2.fastq.gz \
--chromiumV3  \
-i /data/ref_genomes/Mmus_GrCm39 \
-p 32 \
-o /home/GSE140511/salmon_alevin_output/SRR10480618_rev2 \
--expectCells 3000 --forceCells 3000 \
--tgMap /home/txp2gene_SB.tsv

I tried both ISR and ISF (just in case)...mapping rate ranged from zero point something to one point something. I also tried with and without --expectCells 3000 --forceCells 3000 looking at a few suggestions here but it didn't really make any difference.

Alevin.log from the last run is:

[2022-03-27 05:24:09.430] [alevinLog] [info] Found 116716 transcripts(+0 decoys, +39 short and +0 duplicate names in the index)
[2022-03-27 05:24:09.478] [alevinLog] [info] Filled with 116755 txp to gene entries 
[2022-03-27 05:24:09.484] [alevinLog] [info] Found all transcripts to gene mappings
[2022-03-27 05:24:09.495] [alevinLog] [info] Processing barcodes files (if Present) 

[2022-03-27 05:33:37.411] [alevinLog] [info] Done barcode density calculation.
[2022-03-27 05:33:37.411] [alevinLog] [info] # Barcodes Used: [32m359273127[0m / [31m359277869[0m.
[2022-03-27 05:34:04.367] [alevinLog] [info] Throwing 0 barcodes with < 10 reads
[2022-03-27 05:34:05.069] [alevinLog] [info] Total [32m4000[0m(has [32m999[0m low confidence) barcodes
[2022-03-27 05:34:07.956] [alevinLog] [info] Done True Barcode Sampling
[2022-03-27 05:34:25.703] [alevinLog] [warning] Total 91.5531% reads will be thrown away because of noisy Cellular barcodes.
[2022-03-27 05:34:26.221] [alevinLog] [info] Done populating Z matrix
[2022-03-27 05:34:26.232] [alevinLog] [info] Total 60208 CB got sequence corrected
[2022-03-27 05:34:26.234] [alevinLog] [info] Done indexing Barcodes
[2022-03-27 05:34:26.234] [alevinLog] [info] Total Unique barcodes found: 127233006
[2022-03-27 05:34:26.234] [alevinLog] [info] Used Barcodes except Whitelist: 50131
[2022-03-27 05:34:26.966] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2022-03-27 05:34:26.966] [alevinLog] [info] parsing read library format
[2022-03-27 05:46:41.876] [alevinLog] [info] Starting optimizer

[2022-03-27 05:46:42.064] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting
[2022-03-27 05:46:42.064] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting
[2022-03-27 05:49:49.761] [alevinLog] [info] Total 535438.00 UMI after deduplicating.
[2022-03-27 05:49:49.761] [alevinLog] [info] Total 2317116 BiDirected Edges.
[2022-03-27 05:49:49.761] [alevinLog] [info] Total 867878 UniDirected Edges.
[2022-03-27 05:49:49.761] [alevinLog] [warning] Skipped 955 barcodes due to No mapped read
[2022-03-27 05:49:49.766] [alevinLog] [info] Clearing EqMap; Might take some time.
[2022-03-27 05:49:50.011] [alevinLog] [info] Starting white listing of 3044 cells
[2022-03-27 05:49:50.011] [alevinLog] [info] Starting to make feature Matrix
[2022-03-27 05:49:50.014] [alevinLog] [info] Done making feature Matrix
[2022-03-27 05:49:50.717] [alevinLog] [info] Finished white listing
[2022-03-27 05:49:51.422] [alevinLog] [info] Finished optimizer

and the salmon_quant.log looks like:

[2022-03-27 05:24:09.395] [jointLog] [info] setting maxHashResizeThreads to 32
[2022-03-27 05:24:09.395] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2022-03-27 05:24:09.395] [jointLog] [info] The --mimicBT2, --mimicStrictBT2 and --hardFilter flags imply mapping validation (--validateMappings). Enabling mapping validation.
[2022-03-27 05:24:09.395] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2022-03-27 05:24:09.395] [jointLog] [info] The use of range-factorized equivalence classes does not make sense in conjunction with --hardFilter.  Disabling range-factorized equivalence classes. 
[2022-03-27 05:24:09.395] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2022-03-27 05:24:09.395] [jointLog] [info] Using default value of 0.87 for minScoreFraction in Alevin
Using default value of 0.6 for consensusSlack in Alevin
[2022-03-27 05:34:26.966] [jointLog] [info] There is 1 library.
[2022-03-27 05:34:26.967] [jointLog] [info] Loading pufferfish index
[2022-03-27 05:34:26.967] [jointLog] [info] Loading dense pufferfish index.
[2022-03-27 05:34:27.433] [jointLog] [info] done
[2022-03-27 05:34:27.504] [jointLog] [info] Index contained 116,755 targets
[2022-03-27 05:34:27.540] [jointLog] [info] Number of decoys : 0
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 10.50% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 7.74% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 23.62% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 9.60% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 15.40% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 25.48% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 21.52% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 19.16% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 14.78% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 55.96% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 38.08% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 44.98% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 6.34% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 17.24% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 27.34% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 13.16% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 7.48% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 48.24% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 8.14% zero probability fragments
[2022-03-27 05:46:41.460] [jointLog] [info] Thread saw mini-batch with a maximum of 7.12% zero probability fragments
[2022-03-27 05:46:41.504] [jointLog] [info] Thread saw mini-batch with a maximum of 49.08% zero probability fragments
[2022-03-27 05:46:41.505] [jointLog] [info] Thread saw mini-batch with a maximum of 19.74% zero probability fragments
[2022-03-27 05:46:41.520] [jointLog] [info] Thread saw mini-batch with a maximum of 14.40% zero probability fragments
[2022-03-27 05:46:41.533] [jointLog] [info] Thread saw mini-batch with a maximum of 18.12% zero probability fragments
[2022-03-27 05:46:41.534] [jointLog] [info] Thread saw mini-batch with a maximum of 36.44% zero probability fragments
[2022-03-27 05:46:41.535] [jointLog] [info] Thread saw mini-batch with a maximum of 40.30% zero probability fragments
[2022-03-27 05:46:41.547] [jointLog] [info] Thread saw mini-batch with a maximum of 21.58% zero probability fragments
[2022-03-27 05:46:41.561] [jointLog] [info] Thread saw mini-batch with a maximum of 11.16% zero probability fragments
[2022-03-27 05:46:41.606] [jointLog] [info] Thread saw mini-batch with a maximum of 5.44% zero probability fragments
[2022-03-27 05:46:41.789] [jointLog] [info] Computed 115,698 rich equivalence classes for further processing
[2022-03-27 05:46:41.789] [jointLog] [info] Counted 5,906,094 total reads in the equivalence classes 
[2022-03-27 05:46:41.789] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2022-03-27 05:46:41.789] [jointLog] [info] Mapping rate = 1.64388%

[2022-03-27 05:46:41.789] [jointLog] [info] finished quantifyLibrary()

I could try Alevin-fry but based on the above I'm not fully sure it's going to make a difference.

This GSE study has two parts. This is one of them, the second one seems to even miss some files and the QC of those uploaded is not the best either...

I'm trying to figure out whether anything can be extracted from the raw data at all!