fulcrumgenomics / fgpyo

Quality of life improvements for Bioinformatics in Python.
https://fgpyo.readthedocs.io/en/latest/
Other
27 stars 3 forks source link

Add function(s) to set the 5' most alignment on a read as primary #122

Open msto opened 5 months ago

msto commented 5 months ago

This has come up in multiple client contexts and would be very helpful to have.

I think there are two functions that would be helpful:

def set_primary_alignments(template: Template) -> Template:
    """
    Identify the 5' most alignment on each read, and set it as primary.

    The formerly primary alignment will be marked as supplmementary and added 
    to the respective list of supplementary alignments (`r*_supplementals`).
    """
def is_5p_most_alignment(rec: AlignedSegment) -> bool:
    """
    Determine whether the alignment is the 5' most on its read.

    Check the record for supplementary alignments in its `SA` tag.
    If none exist, the record is 5' most by default.
    Otherwise, check the list of supplementary alignments, and identify if any are situated 5' to the current alignment.
    """

(the first could potentially include an argument to set the primary alignment on R1, R2, or both)

nh13 commented 5 months ago

A few questions:

  1. How about set_primaries?
  2. Why not have set_primary_alignments be a instance method on the Template class?
  3. I could see having set_primary_alignments take an enum for "picking strategy", for example: 5'-most-genomic, 3'-most-genomic, 5'-most-sequencing, 3'-most-sequencing, highest-alignment-score, max-mapped-query-bases, max-mapped-target-bases... This list is certainly overkill...
  4. After a new primary is set, we also have to update all the other alignments too right to point to the new primary alignment. Is that planned?
  5. Does is_5p_most_alignment only work on primary alignments? I don't think the SA tag is set on the supplementary alignments.