Missing transcript - Githubissues

skyungyong commented 6 years ago

Hello,

I have RNA-seq data for 3 runs under the same condition, and the data for each run was generated as single-end libraries using 3 lanes (100bp, 100bp and 50bp). After mapping using Salmon against the transcriptome and normalizing the counts using DESeq2, I found out that a control gene had a much higher level of deviation between runs. One of them was reported to have zero counts.

I ran Salmon for each library in that run and got the following results for that gene

Lib1 (50bp): GeneA 1859 1610.000 0.058387 6.313500 Lib2 (100bp): GeneA 1859 1610.000 0.086646 11.991785 Lib3 (100bp): GeneA 1859 1610.000 0.000000 0.000000 Lib1+Lib2+Lib3: GeneA 1859 1610.000 0.000000 0.000011

So I expected that Lib3 would have a problem. But when I ran bowite2 against the transcriptome, the raw counts for GeneA was 39 for Lib1, 21 for Lib2 and 24 for lib3, which looked normal. The number of reads was almost the same in each library, but the size of the library was 1/2 in Lib1 (50bp).

It was not clear to me why Samlon failed to detect GeneA in Lib3 even though the same parameters were used and also why in the combined library, the TPM and number of reads became 0 despite the counts of this gene in Lib1 and Lib2. How would I be able to improve the sensitivity (on the report, only 0.3% was reported to be smaller than K).

for indexing and mapping, I used the default parameters:

salmon index -t Nbv5.1_TrPrm_NtCorrect.fasta Nbv5.1_TrPrm_NtCorrect_index salmon quant -i ../../Database/Nbv5.1_TrPrm_NtCorrect_index -l U -r ../1-Transcriptome_Trimmed_Reads/QTCR002A_S5_L00T_R1_001_trimmed.fq.gz -p 12 -o QTCR002A_S5_L00T_R1_001_trimmed.against.Nbv5.1_TrPrm.out

Thanks,

hiraksarkar commented 6 years ago

Hi @skyungyong ,

It is unexpected to see that count of gene A in Lib3 would vanish. To understand at exactly what stage counts have become 0, can you please look for the count values for the transcripts that originate from GeneA (can be obtained from transcript to gene mapping used in DESeq2). If they are non-zero then it might have to do how tximport is dealing with them, if the corresponding transcript abundances are zero too (or below the expected level), then we need to see if this is because of duplicate transcripts and more aggressive mapping strategies or not.

It would be great if the data is sharable (I completely understand that in some cases they are not), then I can take a deeper look at those reads to discover mapping artifacts.

Thanks for using Salmon.

PS: Interested in knowing the length of the gene.

skyungyong commented 6 years ago

Thanks for your reply. First, this output below is from Salmon. */quant.sf | grep "GeneA" and I added some more information here.

Lib1 (50bp): GeneA 1859 1610.000 0.058387 6.313500 Lib2 (100bp): GeneA 1859 1610.000 0.086646 11.991785 Lib3 (100bp): GeneA 1859 1610.000 0.000000 0.000000 Lib1+Lib2+Lib3: GeneA 1859 1610.000 0.000000 0.000011

second, for the transcript-level analysis, we didn't aggregate the transcripts into genes. So we treated a transcript as a unique gene (or in other words, a gene has only one transcript even though it's not necessarily true). The length of this transcript is 1859 nt.

While looking more into this, I noticed there is another similar transcript (transcriptB), which shares 91% identity with transcriptA (or geneA). The last 600bp were 100% identical. The length of transcriptB is 1900 nt. Do you think this can be a reason? If the reads are mapped into the 100% identical region, how does Salmon behaves?

hiraksarkar commented 6 years ago

Hi, Thanks for your reply. Yes, this is very much possible that a spurious mapping to a longer transcript is masking the original mapping, I find it extremely interesting from a mapping perspective. There are two ways to track down this, if the read file and reference are sharable, I can try to track the mapping artifact, otherwise, I would request you to try this branch of salmon. The build process of the branch is identical with Salmon, and below is the instruction to run it.


build/src/salmon index -t Nbv5.1_TrPrm_NtCorrect.fasta -t Nbv5.1_TrPrm_NtCorrect_index
build/src/salmon quant -i Nbv5.1_TrPrm_NtCorrect_index -la -r <read file> -o <quant_dir> --softFilter --editDistance 4 --rangeFactorization 4

This branch is not exactly designed for single end data, but I would be interested to see if there is any change. If you can provide me with some example dataset I will push possible bug fixes ASAP.

skyungyong commented 6 years ago

I have just sent you the data in email. Please let me know how it turns out!

hiraksarkar commented 6 years ago

Thanks!! Looking into it, replied.

radlinsky commented 5 years ago

Any updates? I am having a related issue with 0 reads being found by Salmon: TBX21 transcripts ENST00000177694 and ENST00000581328 both report 0 read counts, but I am confident there are plenty of reads from these transcripts. (1) download sra and fastq-dump, (2) trim_galore (3) fastQC (4) STAR to generate aligned bams (for other analyses) and Salmon in mapping-based mode to get transcript counts.

I am using the same transcript annotation for STAR and Salmon.

When I samtools view the aligned bam over the TBX21 region, I get back >4000 paired reads. The majority of the TBX21 reads have flags 99 or 147: (# of reads) | (flag) 4431 | 147 12 | 355 14 | 403 2 | 419 4432 | 99

I also confirmed that many of these reads are indeed from the TBX21 spliced transcripts (cross splice junctions).

I am running Salmon in mapping-based mode on the unaligned fastqs, and it is picking up exactly 0 reads in these transcripts.

salmon index -t hg38_salmon_transcriptome.fa -i salmon_hg38_index --type quasi -k 31 salmon quant -i salmon_hg38_index -l ISR -p 8 -1 SRR1615172_1_val_1.fq.gz -2 SRR1615172_2_val_2.fq.gz -o salmon_quant_SRR1615172

The genome-wide distribution of insert size ranges for this sample are unusual (bi-modal), and this is partly why STAR only mapped 65% of the reads. The other issue with the sample is STAR reports 19% multi-mapped reads, but even so, there are still at least 4000 reads uniquely mapping to TBX21.

Attached are:

Output from Salmon

salmon_quant.log lib_format_counts.json.zip

Output from STAR

SRR1615173.Log.final.out.STAR.txt

Output from samtools view over the TBX21 gene start and end (hg38 17:47733244-47746119)

TBX21_reads.txt

FastQC reports of the two fastqs

SRR1615173_1_val_1.fq_fastqc.zip SRR1615173_2_val_2.fq_fastqc.zip

Output from CollectInsertSizeMetrics

insert_size_histogram.pdf insert_size_metrics.txt

rob-p commented 5 years ago

Hi @radlinsky,

I've downloaded the read data and am looking into this. In the meantime, I did notice some relevant output from your log. First, when I run with the automatic library type, I get that the most likely library type is ISF rather than ISR. Second, I note that ~4M fragments are discarded because they produce dovetailing reads. We discard dovetailing reads by default (this was a recent change in default behavior, though it is the same default choice made by e.g. Bowtie2). You can allow these reads to be mapped and quantified by passing salmon the --allowDovetail flag during quantification. Does this make any different in the alignments you see for this gene?

--Rob

radlinsky commented 5 years ago

Thank you @rob-p, you were right. ISR was the wrong libtype. Using ISF fixes my problem, phew!

farshadf commented 2 years ago

Thanks!! Looking into it, replied.

Hi,

I am having a similar issue when running salmon 1.4 on stranded single end data. Transcript count is over 4,000 for certain genes when analyzed by STAR, but salmon does not detect the transcript. Is there any newer version of this branch or suggested configuration that I can use to test my data? Thank you.

rob-p commented 2 years ago

Hi @farshadf,

What exactly is the difference here (i.e. where are those reads going in salmon)? STAR doesn't perform stranded alignment. You can trying using the -l U library flag for salmon to see if it's a strandedness issue. In general, I'd always recommend using the latest version of salmon, which currently is 1.8.0 (though I doubt there would be a difference related to handling of stranded reads here).

Best, Rob

COMBINE-lab / salmon

Missing transcript #218

Output from Salmon

Output from STAR

Output from samtools view over the TBX21 gene start and end (hg38 17:47733244-47746119)

FastQC reports of the two fastqs

Output from CollectInsertSizeMetrics