COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
776 stars 164 forks source link

Critical error in Salmon in alignment mode #131

Open shilpagarg opened 7 years ago

shilpagarg commented 7 years ago

I am interested to run your tool: Salmon. It is a great and easy tool to use.

I encounter this error by using:

./bin/salmon quant -t output.fa -g hgTables.gtf -l IU -a ../../pipeline/rna_seq/all_chr1/merged.bam -p 20 -o salmon_quant

Error:

[2017-04-11 12:33:40.420] [jointLog] [critical] Transcript chr1 appeared in the BAM header, but was not in the provided FASTA file
[2017-04-11 12:33:40.420] [jointLog] [critical] Please provide a reference FASTA file that includes all targets present in the BAM header
If you have access to the genome FASTA and GTF used for alignment 
consider generating a transcriptome fasta using a command like: 
gffread -w output.fa -g genome.fa genome.gtf
you can find the gffread utility at (http://ccb.jhu.edu/software/stringtie/gff.shtml)

I generated output.fa using gffreads from hg19.fa and gtf . In the output.fa, it is annotated in terms of gene ids, but there is chr1 in BAM. I am not sure how can I make them to talk to each other?

Could you please tell me how can I fix it?

rob-p commented 7 years ago

Hi @shilpagarg,

Could you tell me the command you used for generating your alignment files? It looks here like you aligned your reads to the genome directly (rather than aligning to the transcriptome, or e.g., using STAR's ability to project genomic alignments to the transcriptome).

shilpagarg commented 7 years ago

I used tophat. It is already aligned to transcriptome. I magine tophat is designed for rna-seq alignment. They only need gtf and reference genome, then they will know how to align reads to exons. Am I missing something?

rob-p commented 7 years ago

I see. TopHat actually aligns the reads to the genome (using Bowtie and a strategy to perform split-read mapping). The results of TopHat, then, are meant to be used with tools like Cufflinks, which expects reads to be mapped directly to the genome rather than to the transcriptome. Salmon, on the other hand, works like tools such as RSEM / eXpress, which expect alignments to the transcriptome directly. This can be accomplished by either mapping the reads directly to the transcript sequences (using e.g. Bowtie2 / BWA-MEM) or by mapping the reads to the genome using a tool such as STAR, and telling it to project the alignments onto genomic coordinates.

However, I should mention that the easiest thing to do is to simply have Salmon build and index on your transcript set and then pass it the raw (compressed) FASTQ files directly. Since Salmon provides an accurate and lightweight alignment proxy, it can accurately assess transcript abundance estimates directly from the raw (unaligned) sequenced reads. If you have questions about using either of these modes, please take a look at the documentation. I'd also be happy to answer any other questions you might have.

shilpagarg commented 7 years ago

Thanks. Do you know any tool that is used to transfer coordinates of BAM aligned to genome to transcriptome? I heard about sam-xlate, is it good?

rob-p commented 7 years ago

sam-xlate is actually the only tool that I'm aware of to perform this operation on an existing BAM file. I've heard of people using it with success. Of course, I'd also think of doing an analysis with the original reads to validate concordance. Note: if you don't have the original reads, you can do a BAM -> FASTQ conversion to recover the read sequences and then feed them to Salmon.

shilpagarg commented 7 years ago

I am using the following commands:

./bin/salmon index -t ../Homo_sapiens.GRCh37.75.cdna.all.fa.gz -i transcripts_index --type quasi -k 31

./bin/salmon quant -l A -i transcripts_index -r ../../rnaseq_data/SRR764785.fastq ../../rnaseq_data/SRR764786.fastq ../../rnaseq_data/SRR764787.fastq ../../rnaseq_data/SRR764788.fastq ../../rnaseq_data/SRR764789.fastq ../../rnaseq_data/SRR764790.fastq ../../rnaseq_data/SRR764791.fastq ../../rnaseq_data/SRR764792.fastq ../../rnaseq_data/SRR764793.fastq ../../rnaseq_data/SRR764794.fastq ../../rnaseq_data/SRR764795.fastq ../../rnaseq_data/SRR764796.fastq ../../rnaseq_data/SRR764797.fastq ../../rnaseq_data/SRR764802.fastq ../../rnaseq_data/SRR764803.fastq ../../rnaseq_data/SRR764804.fastq ../../rnaseq_data/SRR764805.fastq ../../rnaseq_data/SRR764806.fastq ../../rnaseq_data/SRR764807.fastq ../../rnaseq_data/SRR764808.fastq ../../rnaseq_data/SRR764809.fastq ../../rnaseq_data/SRR764810.fastq ../../rnaseq_data/SRR764811.fastq ../../rnaseq_data/SRR764812.fastq ../../rnaseq_data/SRR764813.fastq -1 ../../rnaseq_data/SRR764782_1.fastq ../../rnaseq_data/SRR764783_1.fastq ../../rnaseq_data/SRR764784_1.fastq -2 ../../rnaseq_data/SRR764782_2.fastq ../../rnaseq_data/SRR764783_2.fastq ../../rnaseq_data/SRR764784_2.fastq -o transcripts_quant

and I get the following error and no .sf file in output:

processed 783000000 fragmentsrag:  3.56267[2017-04-11 20:17:10.337] [jointLog] [info] Automatically detected most likely library type as IU
hits: 184962311, hits per frag:  0.236263couldn't dequeue read chunk
couldn't dequeue read chunk
couldn't dequeue read chunk
couldn't dequeue read chunk

Am I using the commands wrong?

rob-p commented 7 years ago

Hi @shilpagarg,

One issue (which I should definitely check for and report in the option parsing) is that you cannot quantify paired-end and unpaired libraries together (i.e. you should not pass both -r and -1/-2 at the same time). Does quantification complete successfully for you if you just map e.g. the paired end reads?

shilpagarg commented 7 years ago

yes, Thanks so much.

shilpagarg commented 7 years ago

Next, I am using alignment mode for paired-end reads aligned to reference transcript using bwa and I get this warning:

WARNING: Detected suspicious pair ---
        The names are different:
        read1 : SRR764782.282
        read2 : SRR764782.283

[2017-04-12 10:30:00.202] [jointLog] [warning]

WARNING: Detected suspicious pair ---
        The names are different:
        read1 : SRR764782.283
        read2 : SRR764782.284

Shall I just ignore or salmon assumes paired end reads as different reads?

Next question: why non-alignment mode is better than alignment mode?

iromalekou commented 2 years ago

Hi @shilpagarg, did you find a solution for the warning below?

WARNING: Detected suspicious pair --- The names are different: read1 : SRR764782.282 read2 : SRR764782.283

MonaLiu421 commented 1 year ago

Hi, @shilpagarg @iromalekou ,I have the same problem. Can i ignore these warnings? image

MonaLiu421 commented 1 year ago

Salmon detected suspicious pair, will it be quantified?

rob-p commented 1 year ago

No, you should not ignore these warnings if you see many of them. This is indicative of the fact that the alignments for the reads have gotten out of sync, and that you ran your upstream aligner in a manner (or subsequently filtered / re-organized your bam file in a manner) such that the BAM file cannot properly be parsed by salmon anymore.

MonaLiu421 commented 1 year ago

Thanks for your quick reply. My bam file is filtered, and there will be some single-ended reads. I don’t know how to deal with these 'orphan'. The program give me the info:orphan status: paired error likelihood: inf

rob-p commented 1 year ago

The easiest thing to do is to remove the orphan alignments (the alternative is to add an bam record for the unaligned mate for each occurrence of the aligned orphan).

I don't know what program you are using upstream for alignment or what options, but for STAR and Bowtie2 there are appropriate flags you can give to the program to produce such BAM files.

MonaLiu421 commented 1 year ago

Thanks for your quick reply. My bam files were genrated by BWA. If I want to take into account the read of this single-end alignment, I need to add its unmatched information, right?

rob-p commented 1 year ago

Correct. Just like RSEM, salmon requires that if you have alignments for reads that were paired in sequencing, then (1) all of the alignments for a given read are consecutive in the file and (2) the alignment for a reads mate is always directly after (correspondingly before) it's mate.

Say you had a fragment X that aligned in 2 different locations consisting of left read XL and right read XR, the BAM file would have to have records like:

XL (first mapping)
XR (first mapping)
XL (second mapping)
XR (second mapping)

Likewise, say you had a fragment Y that aligned in 2 different locations, consisting of a left read YL that aligned but was an orphan (so that YR didn't align). The BAM file would have to have records like this:

YL (first mapping)
YR (paired with YL above, but unmapped)
YL (second mapping)
YR (paired with YL above, but unmapped)

If the unmapped mates are missing from the file, then the parser can get out of sync and you get the types of warnings you see above.

MonaLiu421 commented 1 year ago

ok, I get it. Thanks a millon.🌼