COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
768 stars 161 forks source link

Low mapping percentage of reads for non-model organism #706

Open Rohit-Satyam opened 2 years ago

Rohit-Satyam commented 2 years ago

Hi developers!!

Background

I am using Alevin to align the P.Falciparum single cell dataset. For this sample we got a total of 27% of reads aligning using CellRanger and I am expecting this number to increase a bit.

Details of sequencing are

Sequencing platform used: HiSeq4000 Sequencing Chemistry: V3 Chemistry (Chromium Single Cell 3' v3, 10x genomics)

raw-data

S1_L003_I1_001.fastq.gz S1_L003_R1_001.fastq.gz S1_L003_R2_001.fastq.gz

Preprocessing

I used the following steps by following this tutorial: I use AGAT because the code shared for GRCh38 didn't work well for my organism.

# preparing the decoys
grep "^>" < PlasmoDB-54_Pfalciparum3D7_Genome.fasta | cut -d " " -f 1 > decoys_p3d7.txt
sed -i.bak -e 's/>//g' decoys.txt

#indexing
~/salmon-1.5.2_linux_x86_64/bin/salmon index -t gentrome_p3d7.fasta -d decoys_p3d7.txt -p 12 -i salmon_index

#making transcript to gene mapping file. First converting gff to gtf and then gtf to a .tsv file whose 10th column contain gene ID and 17th column contain transcript ID
agat_convert_sp_gff2gtf.pl --gff PlasmoDB-54_Pfalciparum3D7.gff -o PlasmoDB-54_Pfalciparum3D7.gtf
agat_convert_sp_gff2tsv.pl -gff PlasmoDB-54_Pfalciparum3D7.gff -o table
awk -F'\t' '{print $17,$10}' OFS='\t' table | sort -u > txgene_p3d7.txt

Then I ran Alevin in three times (each time adding flags as discussed below) to see if mapping percentage increases or not:

nohup ./salmon alevin -l ISR -1 S1_L003_R1_001.fastq.gz -2 S1_L003_R2_001.fastq.gz --chromiumV3  -i ~/salmon_selective/salmon_index/ -p 20 -o ~/salmon_selective/9NT  --tgMap ~/salmon_selective/txgene_p3d7.txt &
{
    "total_reads": 89886851,
    "reads_with_N": 0,
    "noisy_cb_reads": 50060025,
    "noisy_umi_reads": 1579,
    "used_reads": 39825247,
    "mapping_rate": 0.8265580468493662,
    "reads_in_eqclasses": 742967,
    "total_cbs": 675135,
    "used_cbs": 7824,
    "initial_whitelist": 109,
    "low_conf_cbs": 201,
    "num_features": 0,
    "no_read_mapping_cbs": 21,
    "final_num_cbs": 288,
    "deduplicated_umis": 46357,
    "mean_umis_per_cell": 160,
    "mean_genes_per_cell": 39
}

[2021-09-20 22:27:11.264] [alevinLog] [info] Found 5757 transcripts(+16 decoys, +8 short and +0 duplicate names in the index)
[2021-09-20 22:27:11.271] [alevinLog] [info] Filled with 5765 txp to gene entries
[2021-09-20 22:27:11.272] [alevinLog] [info] Found all transcripts to gene mappings
[2021-09-20 22:27:11.282] [alevinLog] [info] Processing barcodes files (if Present)

[2021-09-20 22:30:06.824] [alevinLog] [info] Done barcode density calculation.
[2021-09-20 22:30:06.824] [alevinLog] [info] # Barcodes Used: ESC[32m89886851ESC[0m / ESC[31m89886851ESC[0m.
[2021-09-20 22:30:09.717] [alevinLog] [info] Knee found left boundary at ESC[32m 1051 ESC[0m
[2021-09-20 22:30:11.442] [alevinLog] [info] Gauss Corrected Boundary at ESC[32m 109 ESC[0m
[2021-09-20 22:30:11.442] [alevinLog] [info] Learned InvCov: 796.079 normfactor: 896.047
[2021-09-20 22:30:11.442] [alevinLog] [info] Total ESC[32m310ESC[0m(has ESC[32m201ESC[0m low confidence) barcodes
[2021-09-20 22:30:12.167] [alevinLog] [info] Done True Barcode Sampling
[2021-09-20 22:30:12.316] [alevinLog] [warning] Total 55.6923% reads will be thrown away because of noisy Cellular barcodes.
[2021-09-20 22:30:12.333] [alevinLog] [info] Done populating Z matrix
[2021-09-20 22:30:12.334] [alevinLog] [info] Total 7602 CB got sequence corrected
[2021-09-20 22:30:12.334] [alevinLog] [info] Done indexing Barcodes
[2021-09-20 22:30:12.334] [alevinLog] [info] Total Unique barcodes found: 675135
[2021-09-20 22:30:12.334] [alevinLog] [info] Used Barcodes except Whitelist: 7515
[2021-09-20 22:30:13.043] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2021-09-20 22:30:13.044] [alevinLog] [info] parsing read library format
[2021-09-20 22:33:09.346] [alevinLog] [info] Starting optimizer

[2021-09-20 22:33:09.516] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting
[2021-09-20 22:33:09.516] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting
[2021-09-20 22:33:09.576] [alevinLog] [info] Total 46357.00 UMI after deduplicating.
[2021-09-20 22:33:09.576] [alevinLog] [info] Total 2930 BiDirected Edges.
[2021-09-20 22:33:09.576] [alevinLog] [info] Total 3804 UniDirected Edges.
[2021-09-20 22:33:09.576] [alevinLog] [warning] Skipped 21 barcodes due to No mapped read
[2021-09-20 22:33:09.579] [alevinLog] [info] Clearing EqMap; Might take some time.
[2021-09-20 22:33:09.590] [alevinLog] [warning] Num Low confidence barcodes too less 186 < 200.Can't performing whitelisting; Skipping
[2021-09-20 22:33:09.591] [alevinLog] [info] Finished optimizer

Then I tried setting --keepCBFraction 1. This does decrease the total number of reads being thrown away. However, mapping percentage is still low as compared to what I was getting from CellRanger (27%). I thought that since Alevin takes into consideration the multi mapping reads, the mapping percentage will likely increase because when we ran STAR on this data we found a lot of multi mapping reads.

nohup ./salmon alevin -l ISR -1 S1_L003_R1_001.fastq.gz -2 S1_L003_R2_001.fastq.gz --chromiumV3  -i ~/salmon_selective/salmon_index/ -p 20 -o ~/salmon_selective/9NT_keepCBfraction --keepCBFraction 1  --tgMap ~/salmon_selective/txgene_p3d7.txt &

[2021-09-21 00:11:13.532] [alevinLog] [info] Total 1.97665% reads will be thrown away because of noisy Cellular barcodes.
[2021-09-21 00:11:17.977] [alevinLog] [info] Done populating Z matrix
[2021-09-21 00:11:17.984] [alevinLog] [info] Total 21839 CB got sequence corrected
[2021-09-21 00:11:17.985] [alevinLog] [info] Done indexing Barcodes
[2021-09-21 00:11:17.985] [alevinLog] [info] Total Unique barcodes found: 675135
[2021-09-21 00:11:17.985] [alevinLog] [info] Used Barcodes except Whitelist: 14713
[2021-09-21 00:11:18.697] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

{
    "total_reads": 89886851,
    "reads_with_N": 0,
    "noisy_cb_reads": 1776744,
    "noisy_umi_reads": 2969,
    "used_reads": 88107138,
    "mapping_rate": 21.116563533858807,
    "reads_in_eqclasses": 18981014,
    "total_cbs": 675135,
    "used_cbs": 115457,
    "initial_whitelist": 100000,
    "low_conf_cbs": 744,
    "num_features": 5,
    "no_read_mapping_cbs": 48836,
    "final_num_cbs": 31796,
    "deduplicated_umis": 585622,
    "mean_umis_per_cell": 18,
    "mean_genes_per_cell": 11
}

    "num_processed": 89886851,
    "num_mapped": 18981014,
    "num_decoy_fragments": 39115,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 0,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 0,
    "percent_mapped": 21.116563533858807,
    "call": "quant",
    "start_time": "Tue Sep 21 00:08:37 2021",
    "end_time": "Tue Sep 21 00:14:18 2021"

I also tried setting the --expectCells 2000 to see if this increases the mapping percentage but it doesn't significantly increase mapping percentage.


    "num_bootstraps": 0,
    "num_processed": 89886851,
    "num_mapped": 918407,
    "num_decoy_fragments": 1151,
    "num_dovetail_fragments": 0,
    "num_fragments_filtered_vm": 0,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 0,
    "percent_mapped": 1.0217367610308208,
    "call": "quant",
    "start_time": "Tue Sep 21 00:05:29 2021",
    "end_time": "Tue Sep 21 00:11:04 2021"

Follow-up question

Q1. I observe that "seq_bias_correct": false, "gc_bias_correct": false, are set to false. How do I turn them on ?? does the Alevin accepts --seqBias and --gcBias flags because they weren't mentioned in the comments.

Q2. How do I make use of I1.fastq.gz file. I for sure believe that I have to combine it somehow with R1.fastq.gz but please help me understand how to do that if that's so.

Rohit-Satyam commented 2 years ago

Also in the lib_format_counts.json , the reads doesn't seems to belong to any library type!!


{
    "read_files": "[ /nfs_master/anshul/KAUST_Plasmo_scRNAseq/M-19-3774_9-NT_SI-GA-F3_S1_L003_R1_001.fastq.gz, /nfs_master/anshul/KAUST_Plasmo_scRNAseq/M-19-3774_9-NT_SI-GA-F3_S1_L003_R2_001.fastq.gz]",
    "expected_format": "ISR",
    "compatible_fragment_ratio": 1.0,
    "num_compatible_fragments": 18981014,
    "num_assigned_fragments": 18981014,
    "num_frags_with_concordant_consistent_mappings": 0,
    "num_frags_with_inconsistent_or_orphan_mappings": 29136874,
    "strand_mapping_bias": 0.0,
    "MSF": 0,
    "OSF": 0,
    "ISF": 0,
    "MSR": 0,
    "OSR": 0,
    "ISR": 0,
    "SF": 0,
    "SR": 0,
    "MU": 0,
    "OU": 0,
    "IU": 0,
    "U": 0
}
rahulnutron commented 2 years ago

Dear Rohit, I am facing the same issue with a non-model organism, I was wondering if you are manage to improve the mapping rate with different arguments. If yes, could you please let us know the solution.

Rahul

rob-p commented 2 years ago

Hi @rahulnutron,

Currently, the recommendation is to move to alevin-fry. If you don't have a good annotation (i.e. transcripts as well as introns) then you can index just the transcriptome as normal. The big differences are that, rather than letting alevin do the quantification, you would ask salmon alevin to just a do the mapping phase and produce a rad file. This can be done with (assuming chromium v3 chemistry):

salmon alevin -i <index> --chromiumV3 -l A -1 <reads1> -2 <reads2> -p 16 -o <outdir> --sketch

note that with the alevin-fry pipeline, there is no need to provide the t2g map at this phase (it's used later). Further, the --sketch flag tells alevin just to map the reads and to prepare the RAD file for subsequent quantification with alevin-fry.