So, here's roughly what I was thinking... it's peppered with todo's and comments that should be beautified and documented once you decide what you like/dislike.
Stating the problem: Currently, the parsing is done by scanning across input strings and pulling patterns out based on string operations. This can be unwieldy over time as more translations are added that contain things like multi-word elements and/or similar patterns that have very subtle differences that may be handled too prematurely and incorrectly.
Common Terms
Atom: a piece of text in a string that has some meaning. For now, I feel that regular expressions are the most flexible way of defining them as word boundaries can be made explicit without chopping up an input string. But the RegexAtom should have a class method that makes it dead simple to add a new one with a simple pattern like RegexAtom.from_pattern(r"\bACFT MSHP\b"). Currently only the RegexAtom is implemented, but I could see the use for a CallableAtom that uses string methods to determine its presence in a larger corpus of text.
AtomSpan: the main product of the Atom. It describes the matching text from the pattern, the location of the pattern, and any contextual key-value information that can be decoded and consumed in the translations.
Translation: a callable that receives an AtomSpan and an input string and returns a translated string.
AtomHandler: an object that binds the translation(possibly more than one translation in the future) to the Atom and orchestrates the translation. Each handler should have a name that indicates its intended translation purpose.. i.e. "Aircraft Mishap[ Handler?]"
Parser: the collection of handlers that does the actual parsing. There should be a parser for each type of parsing to be done. So the remarks module would really only have to instantiate a parser for the module -> provide a parse method that uses the parser. The parser class could also have a default handler for errors.
Application Structure:
I believe there are benefits from moving away from the flat package structure as this parsing system grows into more of a framework. However, only the classes required to make a module level parser should be in the __init__.py so that the API stays clean.
Things to think about:
Should the patterns/translations that are used for a module's parser be in their own package? The benefits would be much cleaner parsing modules as only the handlers would need to be defined. Since regex patterns are ugly and translations could possibly be reused, I can see the case for keeping them in their own spaces.
Should the AtomSpan be a dataclass? I think immutability should be used whenever possible, but there's always the frozen=True option.
I think this may end up being Gilligan's "Three Hour Tour". I'm happy to work on it with you, but we may want to make a Project Board for this and use issues for discussions on changes/implementation/planning.
Also, this code is ugly, unfinished, and undocumented. I wrote most of this on deadheads so by all means, change it all up! Once we get settled on the next steps, it may be a good idea to start writing up some docs for it.
Reference #23
So, here's roughly what I was thinking... it's peppered with todo's and comments that should be beautified and documented once you decide what you like/dislike.
Stating the problem: Currently, the parsing is done by scanning across input strings and pulling patterns out based on string operations. This can be unwieldy over time as more translations are added that contain things like multi-word elements and/or similar patterns that have very subtle differences that may be handled too prematurely and incorrectly.
Common Terms
Atom: a piece of text in a string that has some meaning. For now, I feel that regular expressions are the most flexible way of defining them as word boundaries can be made explicit without chopping up an input string. But the
RegexAtom
should have a class method that makes it dead simple to add a new one with a simple pattern likeRegexAtom.from_pattern(r"\bACFT MSHP\b")
. Currently only theRegexAtom
is implemented, but I could see the use for aCallableAtom
that uses string methods to determine its presence in a larger corpus of text.AtomSpan: the main product of the Atom. It describes the matching text from the pattern, the location of the pattern, and any contextual key-value information that can be decoded and consumed in the translations.
Translation: a callable that receives an
AtomSpan
and an input string and returns a translated string.AtomHandler: an object that binds the translation(possibly more than one translation in the future) to the Atom and orchestrates the translation. Each handler should have a name that indicates its intended translation purpose.. i.e. "Aircraft Mishap[ Handler?]"
Parser: the collection of handlers that does the actual parsing. There should be a parser for each type of parsing to be done. So the
remarks
module would really only have to instantiate a parser for the module -> provide aparse
method that uses the parser. The parser class could also have a default handler for errors.Application Structure:
I believe there are benefits from moving away from the flat package structure as this parsing system grows into more of a framework. However, only the classes required to make a module level parser should be in the
__init__.py
so that the API stays clean.Things to think about:
Should the patterns/translations that are used for a module's parser be in their own package? The benefits would be much cleaner parsing modules as only the handlers would need to be defined. Since regex patterns are ugly and translations could possibly be reused, I can see the case for keeping them in their own spaces.
Should the AtomSpan be a dataclass? I think immutability should be used whenever possible, but there's always the frozen=True option.
I think this may end up being Gilligan's "Three Hour Tour". I'm happy to work on it with you, but we may want to make a Project Board for this and use issues for discussions on changes/implementation/planning.
Also, this code is ugly, unfinished, and undocumented. I wrote most of this on deadheads so by all means, change it all up! Once we get settled on the next steps, it may be a good idea to start writing up some docs for it.