gamcil / clinker

Gene cluster comparison figure generator
MIT License
507 stars 66 forks source link

RecursionError in `clinker.align.consolidate` (and a proposed fix) #49

Closed althonos closed 3 years ago

althonos commented 3 years ago

Hi again @gamcil ,

I used clinker to align 60 clusters, each of them sharing between 1 and 4 homolog protein. This caused clinker.align.consolidate to throw a RecursionError, as it seemed to have some issues merging everything.

As you pointed in the documentation, you used the Rosetta stone recursive implementation; there is an iterative implementation that would likely fix that issue. However, there is an algorithmic data structure that would work better for the kind of task that GlobalAligner.build_gene_groups is trying to achieve: a disjoint-set.

If you are fine adding an external pure-Python dependency (disjoint-set), here is the replacement code:

class GlobalAligner:

    # ...

    def build_gene_groups(self):
        """Builds gene groups based on currently stored gene-gene links."""

        ds = disjoint_set.DisjointSet()
        for link in self._links.values():
            ds.union(link.query.uid, link.target.uid)

        for genes in ds.itersets():
            group = Group(label=f"Group {len(self.groups)}", genes=list(genes))
            self.groups.append(group)
althonos commented 3 years ago

... I literally read #48 five minutes after opening this issue :sweat_smile:

gamcil commented 3 years ago

Haha yeah I just swapped it out for the iterative version but the disjoint set idea might still be better. I'd probably still welcome a PR for this, with your above code and the added dependency in setup.py