higher level grammar? - Githubissues

kootenpv commented 7 years ago

I tried to create a grammar containing some simple tokens and then trying to go a next step.

Pretty much I'd like to define higher and higher level rules, but I get complaints a rule is not matching. I'd like to have many optional rules, but still make use of the advertised caching?

Let's imagine the following example, where the 2 higher level sent_good and sent_bad cannot be merged into one rule; they would require a different treatment according to business:

grammar = Grammar(
    """
    sent_good = i _ am _ good
    sent_bad  = i _ am _ bad
    _         = " "
    i         = "i"
    am        = "am"
    good      = "good"
    bad       = "bad"
    """
)

But when I try to parse:

>>> grammar.parse("i am bad")
parsimonious.exceptions.ParseError: Rule 'good' didn't match at 'bad' (line 1, column 6).

The main problem is I think that I would like my higher level rules to be optional in matching. Maybe even add a "how are you doing" rule.

I think my problem is obvious; how should I go about this?

cknv commented 7 years ago

grammar.parse uses by default the top rule, which means that you are trying to parse 'i am bad' with the sent_good rule. Which is why it will not work.

So you could add a new rule on top of the grammar with the following body: sent_good / sent_bad. At the syntax reference, this is called alternation.

An other way to do the same would be to do i _ am _ (good / bad), but I think that boils down to preference, and what you ultimately want to build with this.

Alternatively, you can lookup a rule and try to parse with it like this: grammar['sent_bad'].parse('i am bad'), but that does require you to know a bit in advance, which might defeat the purpose of parsing.

I hope it helps :)

kootenpv commented 7 years ago

Thanks for the quick reply! I was actually afraid this would be the answer...

I'm imagining a lot of rules, where most of them would not hit. I could also have no result. But having to state them all in an "OR" fashion seems very weird. And the final OR would be for a "no match"?

For example, how could we then also parse "how are you doing"?

cknv commented 7 years ago

To be honest, I am not sure how I would go about this. As the two snippets of text ("i am good/bad" and "how are you doing") do not share much structure. You can make the parser very generic, such as this:

words = word (_ word)+
word = ~"\w+"
_ = ~"\s+"

That will just give you a list of words though, so there is not much benefit to actually parsing here compared to just re.finditer(r'\w+', that_text_you_want_parsed), except for the parse nodes also matching the whitespace.

Taking a step back from the concrete examples: It kind of looks like you want to do natural language processing, is that the case? I don't think PEGs are very well suited for that, but I could totally be mistaken, as I don't know much about parse theory. Could be that there is some trick I just don't know about :)

lucaswiman commented 7 years ago

You can certainly use context-free grammars in natural language processing (NLP), though my layman's understanding is that's largely been rejected as a good explanation of language. So you could represent your grammar like:

sentence = statement / question
statement = noun_phrase verb_phrase "."
question = question_word verb_phrase noun_phrase "?"
noun_phrase = ...

If your goal is to process natural language, a library like NLTK might be a better fit. If your goal is to recognize a list of strings, I agree with @cknv that a regular expression would be a better tool.

Libraries like parsimonious are most useful for cases where there's a formal specification of a data format (examples might include json, html, markdown, HL7, etc.).

Alternatively, you could iterate through all the rules in the grammar, finding the ones which match. Something like the following should work:

def get_matching_rules(string, grammar):
    matches = []
    for rule_name, rule in grammar.items():
        try:
            rule.match(string)
        except parsimonious.exceptions.ParseError:
            continue
        else:
            matches.append(rule_name)
    return matches

kootenpv commented 7 years ago

Oh believe me, I have used NLTK, and the better spaCy library. I am currently using a regex Scanner, but I do not know how to nicely put "higher level patterns" on top of the simpler patterns. When I saw this, I thought it might be a solution :)

The code snippet does not really show to me that it will be a (efficient) solution. Maybe you have an idea for another type of parser which would be good at hierarchy of rules :)?

EDIT: Hah, I knew I saw your name somewhere, we've communicated over logpy/kanren ^^

wshayes commented 7 years ago

Finally something I can contribute to :) Typically this type of text processing (more NLP'ish) is done via filters (e.g first sentence detection, store the results in stand-off annotation i.e. not inline with the original text), then tokenization, then you could shift to tagging certain keywords (though the typical NLP next steps are part of speech and then chunking).

kootenpv commented 7 years ago

@wshayes I've handled sentence splitting, tokenization through re.Scanner, no need for POS, and am now looking at the part of managing groups of extracted tokens (indeed, including their span, type and textual match).

My problem is about the hierarchy of the tokens, as regex are a flat structure (and it is not easy/efficient to combine higher level constructs, as pointed out that this parser is good at caching). I did notice that within one Grammar, we can have several levels of hierarchy; i.e. composability. But the specific implementation makes it difficult to match wildly varying rules in 1 grammar.

lucaswiman commented 7 years ago

But the specific implementation makes it difficult to match wildly varying rules in 1 grammar.

It's hard to square your claim that there's a nice hierarchical structure with the three examples you've given. What are you actually trying to do?

erikrose commented 7 years ago

@kootenpv It looks like you're in good hands here, and, like the other commenters, I'm not sure I have a handle on what you're trying to do, but I did want to throw one more idea into the ring. You mentioned you'd already tokenized your input. PEGs are designed to encompass tokenization and parsing in a single pass. Fortunately, Parsimonious recently gained the experimental ability to parse pre-tokenized input. So TokenGrammar and Token might avail you.

Worst case, though, yes: if you make your top-level rule an alternation containing all your other possible starting rules, you should be able to parse any of the things in your example. The caching will make performance not so bad, though you could probably do even better if you factored out any repeated strings of "tokens" into their own rules: then the parser wouldn't have to start from the very top each time it tried a new branch of the alternation. Of course, this would change your final tree shape as well.

erikrose / parsimonious

higher level grammar? #106