deweylab / RSEM

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data
http://deweylab.biostat.wisc.edu/rsem/
GNU General Public License v3.0
409 stars 118 forks source link

Does RSEM need HI:i:<n> tag to correctly process multimappers? #191

Open apredeus opened 1 year ago

apredeus commented 1 year ago

Hi all,

I have a niche project that requires the use of Hisat2. The resulting BAM file has a huge number of multimappers. RSEM runs well on it, but the overall number of counts in the output tables seems to be the same as number of lines in the BAM file - which is like 5 times more than the actual number of reads (because they are multimappers!).

I know this was not the issue with STAR transcriptomic BAMs. I've looked at the tags, and the only difference I see that STAR has and Hisat2 doesn't is the HI:i:<n> tag. Is this required for correct processing by RSEM? If so, do you know of any ready-made solution that would add the tag?

Thank you in advance!

-- Alex

aryan-9898 commented 1 year ago

Yes, RSEM does require the HI:i: tag in order to correctly process multimappers. The HI tag (HI:i:) is a field in SAM/BAM files that is used to indicate the index of a read within the input file. When processing multimapped reads, RSEM uses the HI tag to keep track of which reads are duplicates, so that they are not counted multiple times.

If the BAM files generated by Hisat2 do not contain the HI tag, you can add it using the samtools command-line tool. The samtools addreplacerg command allows you to add or modify the values of any field in the SAM/BAM header, including the HI tag.

Here's an example of how to add the HI tag to a BAM file using samtools:

samtools addreplacerg -r 'ID:1' -r 'SM:sample' -r 'PL:illumina' -r 'PU:unit' -r 'LB:library' -r 'DT:2023-05-03T12:34:56-0700' -R 'HiSeq2500' -i 'HI' -o output.bam input.bam

In this example, the ID, SM, PL, PU, LB, and DT tags are arbitrary and can be changed to match your specific sample information. The -R option specifies the sequencing platform used to generate the data, and the -i option specifies the name of the index tag to be added (in this case, "HI"). The input BAM file is "input.bam", and the output BAM file with the added HI tag is "output.bam".

apredeus commented 1 year ago

Hi Aryan,

Thank you for your answer! I am confused, however. RSEM appears to support hisat2 as a mapping tool - yet the options used by RSEM do not allow for HI tag to be outputted.

The issue I was having was due to sorting of the BAM file; apparently, transcriptomic files are always read name-sorted by default, and both RSEM and Salmon expect this to be true in order to work correctly. Adding HI tag does not seem to influence the result.