allenai / allennlp-semparse

A framework for building semantic parsers (including neural module networks) with AllenNLP, built by the authors of AllenNLP
Apache License 2.0
107 stars 24 forks source link

How to define a 'no-grammar' grammar using the class DomainLanguage? #3

Closed entslscheia closed 5 years ago

entslscheia commented 5 years ago

I am trying to implement several semantic parsers for lambda-DCS. I know production rules can be automatically generated from the functions we define inside DomainLanguage, and so we can base on those production rules to define actions and transition functions, which are the basic build blocks of Allennlp semantic parsing framework. Before relying on DomainLangugae to define the grammar for lambda-DCS, I want to first implement a vanilla seq2seq semantic parser that assumes no constraints on the grammar. I know of course I can implement this directly using a seq2seq model, however, I want to be able to take advantage of the semantic parsing framework provided by Allennlp(e.g., it provides beam search, copy mechanism, e.t.c.). So my question is: what is my best bet to do this? Do I need to use DomainLanguage to define a no-grammar grammar that allows any sequences of tokens from a vocabulary?

matt-gardner commented 5 years ago

Good question. I think you have two options:

  1. Write a DomainLanguage that can generate any token sequence recursively.

  2. Write a new State object that doesn't depend on a grammar.

Either or both of these would be great to have in the repo, if you get them working. Some more detail on each:

  1. This option would probably be the least work, though it would result in you generating sequences hierarchically, right branching. You basically define a DomainLanguage like this:

Token = str  # if you want to have nicer type annotations

class TokenSequenceLanguage(DomainLanguage):
    def __init__(self, vocab):
        for item in vocab:  # not quite right, but you get the idea
            self.add_constant(item, Token)
        # you could also take the current sentence in here, to handle copying separately

    def add_token(self, token: Token, token_list: List[Token]) -> List[Token]:
        return [token] + token_list

    def empty_list(self) -> List[Token]:
        return []

Then you'll get programs that look something like add_token('print', add_token('(', add_token('"hello"', add_token('"world"', add_token(')', []))))). The benefit of this is that it's easy, the drawback is that it adds unnecessary hierarchy. The model can probably memorize this hierarchy reasonably well, though - doing it this way, at each timestep you decide whether to generate another token or stop, then you decide what token to generate. It's completely right branching. So there shouldn't be much difference at all between a typical seq2seq model and this.

  1. Just implement a State object that always returns the same set of actions at every timestep. This actually might also be very easy, though it looks like we need to move get_valid_actions to the base State class; it's currently only defined for GrammarBasedState. If you do this, you don't need a DomainLanguage at all.

If you have questions about either of these, I'm happy to answer them. And as I said, I'd love to see both of these options implemented in the repo, if you're willing to contribute back.

entslscheia commented 5 years ago

Thanks for the explanation! So does it mean that now, at least for lambda-DCS, we can totally get rid of parsimonious_languages and nltk_languages?

matt-gardner commented 5 years ago

If you want lambda-DCS, you need the nltk language. We don't have a way to handle variables with the DomainLanguage grammar induction. For WikiTableQuestions, though, we found that using a different language was better than lambda-DCS (probably because of the difficulty of integrating with SEMPRE for program execution), so we don't actually use lambda-DCS for anything at this point.

entslscheia commented 5 years ago

Ok. Looks like it's still kind of convoluted to define the actions for lambda-DCS, even I only need to generate logic forms without executing them.

entslscheia commented 5 years ago

If you want lambda-DCS, you need the nltk language. We don't have a way to handle variables with the DomainLanguage grammar induction. For WikiTableQuestions, though, we found that using a different language was better than lambda-DCS (probably because of the difficulty of integrating with SEMPRE for program execution), so we don't actually use lambda-DCS for anything at this point.

DomainLanguage can only be used to define Lisp-like language, right?

matt-gardner commented 5 years ago

Yes, the DomainLanguage currently only allows for programs that are a single lisp-like function execution.