allenai / allennlp-semparse

A framework for building semantic parsers (including neural module networks) with AllenNLP, built by the authors of AllenNLP
Apache License 2.0
107 stars 24 forks source link

Using Bert for text2sql encoding #20

Closed entslscheia closed 4 years ago

entslscheia commented 4 years ago

Actually, I find that the problem I mentioned in #18 is more complicated than I thought. The problem is that I am using different ways to encode schema constants (i.e., concatenating it with utterance and use BERT) and syntactic constants (i.e., I am using an embedding layer for SELECT, WHERE, e.t.c.). Then the problem is how do I represent my target SQL sequence? If I am using an embedding layer for all tokens (including schema constants and syntactic constants), then I don't need to differentiate them, and a TextField can work well for the target SQL.

But now, schema constants and syntactic constants are used in totally different ways, I guess what I need is something resembles ProductionRuleField which can differentiate global rule and linked rule, but works for tokens instead of production rules. By analogy, schema constant is something similar to linked rule while syntactic constant is similar to global rule.

Does AllenNLP have some off-the-shelfField can directly work for this or can at least serve as a workaround? I think this is kind of a common need when we use token-based decoding instead of grammar-based decoding for semantic parsing tasks. Maybe we need a new extended variant of TextField. Does it make sense? Any suggestions would be greatly appreciated!

matt-gardner commented 4 years ago

We don't have any off-the-shelf Field for this. If I were writing one, I would probably just pass in the schema and syntax constants as additional constructor arguments, and make an internal TextField as I described in #18 (including both schema and syntax constants somehow for joint BERT encoding). Then I would make two additional outputs, which are the start positions of the schema constants, and the start positions of the syntax constants, the same way as I described in #18. But it's easy enough to do this as three separate fields in a dataset reader; not sure that you really need to make a Field for this, unless you really want to.

entslscheia commented 4 years ago

@matt-gardner Thanks for the response! The thing is that I am concatenating schema constants and question, not the syntax constants. I am not using BERT to get representations for syntax constants, and I think it might not make much sense to do it. To make it more clear, given an example as follows,

question: What's the student ID of Jack?
schema: [ID, STUDENT_TABLE, STUDENT_NAME]
SQL: SELECT ID FROM STUDENT_TABLE WHERE STUDENT_NAME = 'jack'

Then for ID, STUDENT_TABLE and STUDENT_NAME, I want to get their representations by feeding [CLS]ID [SEP] STUDENT TABLE [SEP] STUDENT NAME [SEP] What's the student ID of Jack? [SEP] to BERT, while for syntax constants likeSELECT, FROM, WHERE, =, I want to have an embedding layer for them, so they can share the same representation across different data points. Then what's the best way for representing the target sequenceSELECT ID FROM STUDENT_TABLE WHERE STUDENT_NAME = 'jack'? It looks like SELECT, FROM, WHERE, = should be converted into an index in the global syntax constants vocabulary, while the schema constants should be converted into its position in current input schema (e.g., ID has position 0 in schema: [ID, STUDENT_TABLE, STUDENT_NAME]).

matt-gardner commented 4 years ago

I don't have any particular insights on how to do this, I think, except that the easiest thing to do seems to be to just include the syntax constants in the BERT sequence, even if it might not be necessary. If it's not very long, what would it hurt?

entslscheia commented 4 years ago

@matt-gardner I agree that if we use BERT to encode both syntax and schema constants, then we can simply use a ListField[IndexField] to represent the whole target sequence. But it's not that reasonable to encode syntax constants using BERT right? We may still need a global vocabulary for all syntax constants, so a single syntax constant cannot be represented using IndexField. So that's why I think we may need a new Field similar to ProductionRuleField but works for tokens that basically is a container for two different Fields, but all elements in a ListField should have the same type, we cannot directly put two different Fields into it, so the new Field seems necessary.