fulcrumgenomics / fgpyo

Quality of life improvements for Bioinformatics in Python.
https://fgpyo.readthedocs.io/en/latest/
Other
27 stars 3 forks source link

Feature: SAM tag enum #102

Open msto opened 8 months ago

msto commented 8 months ago

I think it would be helpful to provide an enum that describes the standard SAM tags.

I am happy to implement this but would appreciate feedback on design considerations before starting.

I have two primary questions.

  1. Should the member names be the actual SAM tags, or more semantically meaningful?

    e.g.

    class SamTag(str, Enum):
        """Standard SAM tags."""
    
        RG: "RG"
        """Read group."""
    
        RX: "RX"
        """Sequence bases of the (possibly corrected) unique molecular identifier."""

    or

    class SamTag(str, Enum):
        """Standard SAM tags."""
    
        READ_GROUP: "RG"
        """Read group."""
    
        UMI: "RX"
        """Sequence bases of the (possibly corrected) unique molecular identifier."""

    (note that I suggest mixing in str so the enums can be passed directly to pysam's tagging functions, e.g. read.has_tag(SamTag.UMI))

  2. To permit extensions to custom tags, I think providing a class decorator would be more helpful than subclassing. The decorator would

    1. inject the standard tags
    2. enforce uniqueness using enum.unique
    3. enforce that custom tags are two-character strings, and
    4. optionally enforce that custom tags adhere to SAM convention, namely that tags start with "X", "Y", or "Z", or are lowercase

    (i) and (ii) would be achievable with subclassing, but I think a decorator is necessary to provide the validations on custom tags. Thoughts?

    e.g.

    
    @sam_tag(strict=False)
    class ClientTag(str, Enum):
        """Custom SAM tags used for $client."""
    
        FOO: "XF"
        """Foo."""
    
        BAR: "XB"
        """Bar."""
nh13 commented 8 months ago

Does this belong in pysam, or in htslib and exposed in pysam? Have we asked them if they would accept a contribution? And if not, why?

msto commented 8 months ago

I can open a similar issue on pysam.

I don't think I have a clear mental model on distinguishing the features we consider for inclusion in fgpyo vs attempting to submit upstream, so I started here

msto commented 8 months ago

https://github.com/pysam-developers/pysam/issues/1272

msto commented 7 months ago

@nh13 Given the lack of response from pysam, would this be re-considered for inclusion in fgpyo?

I have an initial implementation at https://github.com/msto/sam_tags/ that I would love to merge in here

msto commented 5 months ago

@nh13 bump 🙂

msto commented 4 months ago

@nh13 bump

nh13 commented 4 months ago

I am ok with this. @tfenne ?

tfenne commented 4 months ago

I'm game for the first half - i.e. an enum that encodes standard values. I generally prefer to use the tags for the name (so e.g. NM, SA) rather than a longer name because the former is explicit and well-known and I think this is mostly to prevent typos?

I'm not sure I see the value (yet) in the decorator for making other enums of sam tags. I.e. I tend to think we should use str anywhere we require a tag, and then you can supply one of these enum values, a str directly or another similar enum.