Some reads do not map the proper transcript...? Index of transcripts corrupted?

Hi,

First, thanks for the excellent software! I hope this comment will help someone.

I have this issue using rsem-calculate-expression and I found it posted in many places, but without not a clear explanation: The alignment of read "blablabla" to transcript 229379 starts at -80 from the forward direction, which should be a non-negative number! It is possible that the aligner you use gave different read lengths for a same read in SAM file.

In my case, I'm using bowtie2 for the alignment. I'm running the latest rsem version. I created a new hg38 index with the latest ENSEMBL gtf (101 version) simply like this: rsem-prepare-reference --gtf $gtf --bowtie2 $fa RSEMindex/Hs_GRCh38

I have a tolal of 229420 transcripts in this index. checked in the log of the reference and also like this: grep ">" Hs_GRCh38.transcripts.fa | wc -l

I am running almost defatult parameters: rsem-calculate-expression -p 3 --output-genome-bam --bowtie2 --bowtie2-k 30 --time $reads $RsemIndex $RSEM_out

I have 10 samples, and I obtain the same warnings in all of them. All about the same transcript, the 229379, with many different reads.

I had a look at the logs and temporal folders and saw this: cat sample1.temp/sample1.omit 229380 cat sample2.temp/sample2.omit 229380 ... and the same for all the 10 samples

I think that transcript 229380 is somehow a problem for RSEM I look for them in the index files and (i am using the softmask version):

ENST00000674911 → 229379 ATGGAAGAGGGAAACAACAATGAAGAGGTAATTCACTTGAACAACTTTCACTGCCATCGGGGACAAGACTTTGTAATTTTCTTCTGGAAAACCCAGATTATCCAAAGAGAGAAGACAGAATCATTATAAATCCCAGTAGCAGTCTGCTGGCCAGCCAAGATGAGACAAAGTTGCCTAAAATAAGACTTTTTTGACTATTCTAAATTGACTCCTCTTGACCAGCACTGCTTCATCCAAGCTGCTGACCTCCTCATGGCCGACTTCAAAGTGCTCAGTAGTCAGGACATCATGTGGGCCCTGCACGAGCTCAAAGGACACTATGCAATCACCCGAAAGGCCTTTTCTGATGCCATTAAAAAATGGCAGGAGCTATCACCAGAAACCAGTGGAAAAAGGAAAAAGAGAAAAGAAATGAACCAGTATTCTTAAATTGATTTCAAGTTTGAACAAG ENST00000674661 --> 229380 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

It is untderstable that RSEM omits this transcript which is full of Ns, and is the only one like that ( with this I check it → egrep -B 1 -v "A|T|G|C" Hs_GRCh38.idx.fa) But this is just a warning in the rsem-calculate-expression log (at least I think that this is what causes the warning below): Warning: The SAM/BAM file declares less reference sequences (229419) than RSEM knows (229420)! Please make sure that you aligned your reads against transcript sequences instead of genome.

What's going on? It is possible that the transcript index is corrupted as a result of this omitted transcript? Therefore, the coordinates of the transcripts are not correct and the read seems to have a negative start.

I remove that transcript from the gtf, redo the rsem index and try the rsem-calculate-expression again. Then, the warning disappears, the error message disappears. The .omit files are empty. The rsem-run-em do not crash anymore and the whole thing works.

I'm not quite sure if what I assume is happening is true. It's just based on this example ... but if it's true, I think this "warning" that migth modifies the index of the rsem transcript shouldn't be treated as a warning.

deweylab / RSEM

Some reads do not map the proper transcript...? Index of transcripts corrupted? #149