problem with building decoy-aware transcriptome file

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)? bulk mode

Describe the bug

Version Info: This is the most recent version of salmon.
### salmon (selective-alignment-based) v1.4.0
### [ program ] => salmon
### [ command ] => quant
### [ index ] => { /scratch/scratch/skgtjzw/workspace/middle_aged_microglia/salmon_quantification_SAF/salmon_index/ }
### [ libType ] => { A }
### [ mates1 ] => { /scratch/scratch/skgtjzw/workspace/middle_aged_microglia/salmon_quantification_SAF/SRR2557119_1.fastq.gz }
### [ mates2 ] => { /scratch/scratch/skgtjzw/workspace/middle_aged_microglia/salmon_quantification_SAF/SRR2557119_2.fastq.gz }
### [ threads ] => { 8 }
### [ validateMappings ] => { }
### [ output ] => { quants/SRR2557119_quant }
### [ gcBias ] => { }
Logs will be written to quants/SRR2557119_quant/logs
[2020-12-13 19:07:51.508] [jointLog] [info] setting maxHashResizeThreads to 8
[2020-12-13 19:07:51.508] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2020-12-13 19:07:51.508] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2020-12-13 19:07:51.508] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2020-12-13 19:07:51.508] [jointLog] [info] parsing read library format
[2020-12-13 19:07:51.508] [jointLog] [info] There is 1 library.
Exception : [Error: The index version file /scratch/scratch/skgtjzw/workspace/middle_aged_microglia/salmon_quantification_SAF/gencode_v36_decoy_salmon-1.4.0/versionInfo.json doesn't seem to exist.  Please try re-building the salmon index.]
salmon quant was invoked improperly.
For usage information, try salmon quant --help

To Reproduce Steps and data to reproduce the behavior: Building the index:

Download the transcript and the genome sequence of human, latest version

wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/GRCh38.primary_assembly.genome.fa.gz
wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.transcripts.fa.gz

unzip:

gunzip gencode.v36.transcripts.fa.gz
gunzip GRCh38.primary_assembly.genome.fa.gz

following the instruction for building a SAF salmon index (https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/ https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype):

grep "^>" GRCh38.primary_assembly.genome.fa | cut -d " " -f 1 > GRCh38.decoys.txt
sed -i 's/>//g' GRCh38.decoys.txt
cat gencode.v36.transcripts.fa GRCh38.primary_assembly.genome.fa | gzip > GRCh38.gentrome.fa.gz
salmon index -t GRCh38.gentrome.fa.gz -d GRCh38.decoys.txt -p 12 -i salmon_index --gencode

Run salmon

IDX="/scratch/scratch/skgtjzw/workspace/middle_aged_microglia/salmon_quantification_SAF/salmon_index/"
for fn in /scratch/scratch/skgtjzw/workspace/middle_aged_microglia/salmon_quantification_SAF/SRR{2557119..2557121}
do
samp=`basename ${fn}`
echo "Processing sample ${samp}"
salmon quant -i $IDX -l A -1 ${fn}_1.fastq.gz -2 ${fn}_2.fastq.gz -p 8 \
--validateMappings \
-o quants/${samp}_quant \
--gcBias
done

Specifically, please provide at least the following information:

Which version of salmon was used? 1.4.0
How was salmon installed (compiled, downloaded executable, through bioconda)? 1.3.0 salmon was installed by miniconda, then updated to 1.4.0 via miniconda
Which reference (e.g. transcriptome) was used? Gencode_human/release_36/gencode.v36.transcripts.fa.gz
Which read files were used? fastq.gz file with 150 paired reads
Which which program options were used?

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots or terminal output to help explain your problem.

Desktop (please complete the following information): On a cluster with Ubuntu Linux

Additional context I was running the default method of building index with only the gencode.v36.transcripts.fa.gz, and the mapping rate for my fastq files were 42-55%, which i think is quite low (based on what i read, it is expected to be around 70%). Then, I thought i might give a go with building decoy-aware transcriptome index, and this did not go well as i presented above.

I found an old post with the similar issue, which the conclusion was that the index building was not completed due to memory allocation on the cluster. I was wondering if this is the same with my case, what memory allocation should i ask the admin for to do this index building job? Any guidance on this would be greatly appreciated, and thank you in advance!

Hi @Davidwei7,

Thank you for the very detailed bug report! So, I have two initial responses / thoughts about your issue.

First, you asked if the issue may be related to a memory allocation error wherein the index didn't build successfully. This is quite possible (and the error you see during quantification is consistent with that). The full decoy index is substantially larger than the index on just the transcriptome (after all, it is indexing the entire human genome in addition to the transcriptome). One thing you might try to test this hypothesis, other than requesting to build on a node with more RAM, is to compute a hash (e.g. md5 or sha256 sum) on all of the files in the index, and also record their sizes. Then we can build the index on the same version of the files on our end and compare.

Second — and perhaps more importantly for your end goal — the main purpose of the decoy-aware index is to improve specificity rather than sensitivity. That is, the decoys are designed to help avoid spurious mapping of reads to an annotated transcript when a better explanation for the read exists elsewhere in the genome. However, the reads that are mapped to decoys are not otherwise used for quantification. Thus, using the decoy aware transcriptome index is unlikely to improve your mapping rate. I agree that your mapping rate does seem rather low. There are a few potential culprits here, and some diagnostics we can look at to see what might be going wrong. First, you can take a look at the file aux_info/meta_info.json in the salmon quantification directories to get a few more details about why reads were not mapped. If you share one of those files here I can describe the relevant fields. Also, I have two more rather common things to consider that might affect the mapping rate. One is to add the sequence for the ribosomal RNAs to your transcriptome before indexing and then quantifying. If your mapping rate increases considerably, this is evidence of rather inefficient depletion prior to sequencing. The other thing to consider is to do basic adapter / quality trimming on the reads to see if that affects your mapping rate at all.

I hope these two different responses are useful, and I'll keep this issue open so feel free to reply here with any further questions or discoveries you make regarding the above.

Dear @rob-p,

Hope you are well, thank you so much for your response and input.

Following your suggestions, firstly I tried to compute a hash on all of the files in the index using this command: find /scratch/scratch/skgtjzw/workspace/middle_aged_microglia/salmon_quantification_SAF/salmon_index -type f -exec md5sum {} \; | md5sum c0959830010f005c7f2e041aac4829ef -

Secondly, I have attached the aux_info/meta_info.json on here SRR2557120 SRR2557121 SRR2557119

Thirdly, as I am quite new to the RNA-Seq analysis world and coding, I am not sure how can I add the sequence for the ribosomal RNA to my transcriptome. For example, where could I find such files with gencode? and with a file of ribosomal RNA, do I give both the gencode.v36.transcripts.fa.g and the ribosomal RNA together to the salmon index command? or there is certain parameter in the salmon index command that needs to be changed?

Finally, for these three data, it was from published paper, and I am trying to do re-analysis on them. And before salmon quantification, I have already ran fastp on them to do the trim and QC.

Hopefully, I am making sense here. Please let me know if there is anything incorrect here. Thank you for helping out in advance.

Best Wishes,

David

Thanks for the details, David.

I wanted to get a broader perspective of what was going on mapping-wise, so I ran the first sample through a STAR->salmon pipeline (the brand new one in nf-core). This will also let us get a notion of what is happening in terms of mapping to the genome versus the annotated transcripts. Here's what I found:

The STAR mapping report shows:

                                 Started job on |       Dec 15 18:01:41
                             Started mapping on |       Dec 15 18:12:46
                                    Finished on |       Dec 15 18:25:37
       Mapping speed, Million of reads per hour |       103.29

                          Number of input reads |       22120369
                      Average input read length |       290
                                    UNIQUE READS:
                   Uniquely mapped reads number |       18500061
                        Uniquely mapped reads % |       83.63%
                          Average mapped length |       284.22
                       Number of splices: Total |       5999311
            Number of splices: Annotated (sjdb) |       5961890
                       Number of splices: GT/AG |       5905895
                       Number of splices: GC/AG |       41312
                       Number of splices: AT/AC |       11584
               Number of splices: Non-canonical |       40520
                      Mismatch rate per base, % |       0.36%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.46
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.72
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       1805463
             % of reads mapped to multiple loci |       8.16%
        Number of reads mapped to too many loci |       3745
             % of reads mapped to too many loci |       0.02%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       8.07%
                     % of reads unmapped: other |       0.12%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

yet, only 9,310,303 reads were determined by STAR to project properly to annotated transcripts (slightly less than are mapped to the transcriptome by salmon, at least without the decoy sequence included). So, there is a very high fraction of the reads that align to the genome, but a much smaller fraction ~45%-50% that align to the transcriptome.

There are many reasons something like this could happen, but it suggests that there are a lot of reads being generated from outside of annotated transcripts. This could be a mix of novel transcripts in this sample (both at entirely novel loci, as well as novel transcripts within annotated loci), as well as of noisy transcription, unannotated transcribed pseudogenes etc. I took a look at the bigwig generated by this pipeline, and STAR seems to be mapping quite a lot of reads to chr21 as well as to the mitochondrial genome (chrM). However, as evidenced by the fact that neither salmon using selective-alignment (and mapping to the transcriptome) nor the STAR->salmon pipeline see these reads mapping to annotated transcripts they must be arising from outside of these regions.

There are a few option in this case. You could manually investigate where these reads are coming from by aligning them with e.g. STAR and inspecting the BAM files. Alternatively, you could attempt to assemble novel transcripts (e.g. using StringTie or Scallop) and then add them to the transcriptome for quantification. However, it does seem that getting down to the bottom of the relatively low mapping rate to the annotated transcriptome, in light of the relatively high mapping rate to the whole genome, but outside of annotated transcripts, may require a bit more digging. I'm happy to answer any other specific questions that might arise if you dig into this.

Dear @rob-p,

Hope you are well.

Thank you so much for the time and effort that you put in for helping me.

I finally built full decoy index with more RAM allocation to 35Gb (found this in the previous posts) and successfully ran salmon quant on the samples with full decoy index.

The mapped rate of these samples with decoy sequence included, and the mapping rate dropped slightly than salmon without decoy sequence. I think this is expected, as this is to avoid spurious alignments to annotated transcripts (more conservative approach?).

The explanations and reasons that you proposed are very useful and helpful. The fact that two methods of quantification suggest the same result is promising to believe that there will be more digging for the truth to be revealed. But I am a bit out of my depth on assembling novel transcripts, or extracting unmapped reads and aligning them with STAR and inspecting the BAM files. But I will give a try.

Best Wishes,

David

COMBINE-lab / salmon

problem with building decoy-aware transcriptome file #603