SalomonisLab / altanalyze3

Apache License 2.0
2 stars 3 forks source link

Working on junctions annotations #17

Closed michael-kotliar closed 1 year ago

michael-kotliar commented 1 year ago

Juncounts and intcounts files are processed separately. First, we load all juncount files and JOIN them by coordinates (SQL type of JOIN), so at the end, we have only unique chr:start-end regions. These coordinates are then saved into the sorted indexed BED file. Technicaly, we save three files - one with the correct coordinated and two with only start and only end coordinates. We need these two additional files to easily iterate over the sorted start and end coordinates separately, but we still want to be able use BED format for that. Aggregated counts are saved separately in the python pickled format as they will be loaded after annotation assignment.

Then we separately try to annotate all junction start and all junction end coordinates. For that we first iterate over the sorted indexed gene model BED file and sorted indexed junction start BED file. For each start coordinate we identify the region from the gene model that includes this coordinate and save it in a form of Annotation named tuple. In addition to the gene, exon, and strand, we also save the match type which can be one of the AnnMatchCat enum class and position that won't be 0 only when the match type is one of EXON_MID, INTRON_MID, or CLOSEST. CLOSEST is the specific type of match for those cases when the junction coordinate is at the beginning or at the end of the chromosome and didn't overlap any of the known from gene model file regions. The same procedure is done for the junction end coordinates. Note, that multiple annotations can be assigned to the same start or end junction coordinate as certain genes may have overlapping regions.

Then, once we collected all annotations from our junction start and end coordinates, we iterate over all junctions and try to find the best annotation for each of them. If we have multiple annotations for junction start and junctions end we iterate over the product of them and choose the best. The logic of choosing the best annotation is implemented in the GroupedAnnotations class. Briefly, we try to find all exact, partial and distant matches. We call distant all the matches when junction start and end coordinates belong to the different genes. The best annotation is either the first found exact match, or the first found partial match, or the first found distant match, depending of what kind of matched were found.

Once we assigned all junctions annotations, we load intcounts which are already annotated and do the same type of JOIN as we did for the juncounts. However, we don't need to assign any annotation for them. Loaded intcounts and annotated juncounts are then merged into one big list of regions sorted by coordinates. The results are then saved into the h5ad and BED files (in the BED file we save only coordinates and annotations to be visualized in IGV)

class AnnMatchCat (enum.IntEnum):
    EXON_START = enum.auto()
    EXON_END = enum.auto()
    EXON_MID = enum.auto()
    INTRON_MID = enum.auto()
    CLOSEST = enum.auto()
    def __str__(self):
        return self.name

Annotation = namedtuple(
    "Annotation",
    "gene exon strand position match"
)