DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 113 forks source link

Clarify reporting options for multi-mapping reads. #279

Open ewallace opened 3 years ago

ewallace commented 3 years ago

My collaborators and I on the riboviz project are using hisat2, and finding the documentation on "reporting options" for multi-mapped reads ambiguous, so are requesting clarification.

Current documentation (v2.2.1, accessed 27 Jan 2021) says:

  • -k <int> It searches for at most <int> distinct, primary alignments for each read. Primary alignments mean alignments whose alignment score is equal to or higher than any other alignments. The search terminates when it cannot find more distinct valid alignments, or when it finds <int>, whichever happens first. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Each reported read or pair alignment beyond the first has the SAM ‘secondary’ bit (which equals 256) set in its FLAGS field. For reads that have more than <int> distinct, valid alignments, hisat2 does not guarantee that the <int> alignments reported are the best possible in terms of alignment score. Default: 5 (linear index) or 10 (graph index).

I think the logic is:

  1. hisat2 searches for up to <int> valid alignments (i.e. those over --score-min). Then it stops when it reaches <int> alignments, although they may not have the best scores.
  2. By default, hisat2 reports each of the primary alignments, i.e. those with maximum score. So if multiple alignments are detected with equal maximum scores, they are all reported. However, if the --secondary flag is used, hisat2 reports all of the valid alignments.

Is the first alignment generated randomly/pseudorandomly or deterministically?

I think that the "Other options" section of documentation implies that multi-mapped reads will be mapped uniformly at random:

  • --non-deterministic Normally, HISAT2 re-initializes its pseudo-random generator for each read. It seeds the generator with a number derived from (a) the read name, (b) the nucleotide sequence, (c) the quality sequence, (d) the value of the --seed option. This means that if two reads are identical (same name, same nucleotides, same qualities) HISAT2 will find and report the same alignment(s) for both, even if there was ambiguity. When --non-deterministic is specified, HISAT2 re-initializes its pseudo-random generator for each read using the current time. This means that HISAT2 will not necessarily report the same alignment for two identical reads. This is counter-intuitive for some users, but might be more appropriate in situations where the input consists of many identical reads.

I think that the answers to #62 and #59 also imply that read assignment to near-duplicate genes is uniformly at random, tested on a pair of near-identical yeast paralogs TEF1/YPR080W and TEF2/YBR118W.

It would be helpful for this to be completely clear in the "reporting options" section. If if I have interpreted it right, random assignment of reads is currently explained only in the section of documentation dedicated to turning off random assignment.