bcgsc / ProbeGenerator

Determine the nucleotide sequences of mutations
Other
0 stars 1 forks source link

Better parsing #1

Open ahammel opened 9 years ago

ahammel commented 9 years ago

Right now, each of the *Probe classes defines its own ad-hoc parser using a complicated regular expression. This results in a whole lot of duplicated code, not only in the parsers themselves, but because every probe class has duplicate logic for how to disambiguate amino acid sequences, the names of genes, globbed exons, etc.

It would be a lot better if we had a centralized, EBNF-style parser, *Probe classes that expect to be fed a parse tree, and a centralized disambiguator. These would be used something like this:

class SomeKindOfProbe(AbstractProbe):
    def __init__(self, statement):
        # 'statement' is a data structure containing all the information in a
        # probe statement, including the comments
        self.statement = statement
        self.variant = self._make_variant(self.statement)

    @staticmethod
    def explode(self, statement, annotation):
        parse_tree = Parser.parse(statement)
        for parsed_statement in Disambiguoator.disambiguate(
                parse_tree, annotation):
            return SomeKindOfProbe(parsed_statement)

    def _make_variant(self):
        # This method contains the logic which produces a Variant given the
        # infomation in a disambiguated statement.
ahammel commented 9 years ago

I've made an ENBF parser using the simpleparse library. It's in the 'parser' branch.

Simpleparse doesn't support python3 directly, so we'll need to update the build process to do automatic dependency resolution and 2to3 conversion before we can integrate it.