allenai / allennlp-semparse

A framework for building semantic parsers (including neural module networks) with AllenNLP, built by the authors of AllenNLP
Apache License 2.0
107 stars 24 forks source link

How to use Bert for text2sql encoding? #18

Closed entslscheia closed 4 years ago

entslscheia commented 4 years ago

I already know how to use Bert in AllenNLP (i.e., how to configure the tokenizer, indexer and model for it). Now I want to reimplement some Text2SQL (the task of mapping natural language to SQL) baselines which utilize Bert for encoding using AllenNLP, and there are some implementation details I am not quite sure about.

For Text2SQL, the input is an utterance (which is a question in natural language) and a schema of the corresponding database from which the question can be answered. During the encoding phase, a common way is to concatenate the utterance with all schema constants (i.e. table names and column names), separating them with [SEP] and feed them to Bert. For the schema constants, they can always contain multiple words (e.g. column name Student_ID has two words), and even a single word is possible to be split into multiple word pieces. What people normally do is to add another LSTM layer over word pieces to get the final encoding for each schema constant.

I am still trying to figure out the optimal way to implement this using AllenNLP. Now what I am doing is: For each instance I have a TextField for the target logical form indexed with a SingleIdIndexer, I also have a TextField that stores the concatenation of utterance and schema constants, and this one is indexed by a Bert indexer so I can feed it to Bert. To encode each token in the first TextField, I need to first determine is this token a syntactic token (e.g., Select, Where, From) or a schema constant. If it is a schema token, then its encoding should be derived from the LSTM over the corresponding span of Bert output, so I need to know the span of each schema constant, here the problem comes, how to I store this information and pass it via forward? Currently what in my mind is for the first TextField, on top of a SingleIdIndexer, I can define a new indexer to denote the corresponding start and end position in the concatenated sequence to be feed to Bert. If it is a syntactic token (in other words, not in the concatenated sequence), then both the start and end are -1. If it is a schema constant, then I can fetch the span of output from Bert based on the start and end, and then feed this span to a LSTM to get the final encoding. But this seems a little bit clumsy to me. Maybe there are someFields provided by AllenNLP can directly serve for my purpose. Any suggestions? Many thanks!

TL;DR: How to recover the span in the input sequence to Bert for a specific token or a phrase (for example student id, assuming it's tokenized into stud, ##ent, id, then the corresponding span should have length 3) so we can apply a LSTM or Pooling over the span to get a representation for it?

matt-gardner commented 4 years ago

My recommendation: use a single TextField and the PretrainedTransformer tokenizer, indexer, and embedder. Do all of the concatenation that you want before passing tokens into the TextField. As another field, pass in a mapping of offsets for the schema indices, and use that to pull them out after embedding using advanced indexing.

This way you'll get joint BERT encoding of the schema and the question, and if you're fine-tuning things on your final objective, you probably don't need an additional LSTM, as BERT can learn to put the important features in the first (or last) wordpiece of a word.

entslscheia commented 4 years ago

My recommendation: use a single TextField and the PretrainedTransformer tokenizer, indexer, and embedder. Do all of the concatenation that you want before passing tokens into the TextField. As another field, pass in a mapping of offsets for the schema indices, and use that to pull them out after embedding using advanced indexing.

This way you'll get joint BERT encoding of the schema and the question, and if you're fine-tuning things on your final objective, you probably don't need an additional LSTM, as BERT can learn to put the important features in the first (or last) wordpiece of a word.

Thanks! Actually the reason I need an additional LSTM is that sometimes a schema constant (i.e., a column or a table name) might comprise several words, rather than a single word, and I need a schema constant level representation.

entslscheia commented 4 years ago

Also, it's still a little bit unclear to me that how to implement the mapping you mentioned? Are we just using a Python Dict to do this and wrap it with a MetaField? If yes, is there any suggested way to get the offset for each schema constant? And what key should be used for this mapping Dict? Just the original tokens in string? Can you please elaborate it a little bit? @matt-gardner Really appreciate it!

matt-gardner commented 4 years ago

You can have a schema-constant-level representation without an additional LSTM. Create a token sequence like this:

[CLS] schema constant 1 [SEP] schema constant 2 [SEP] ... [SEP] actual question tokens [SEP]

(or however you want to do it)

Pass that as input to the TextField. As you're constructing the token sequence, keep track of the first token index of every schema constant. Put those into a ListField[IndexField], called schema_constants. In your model, you'll get a tensor of shape (batch_size, num_constants, 1), with values that go up to the length of your token sequence. You can use that to do advanced indexing to grab the embedding of the first wordpiece corresponding to each schema constant. Because the schema constants have gone through BERT, no additional LSTM is necessary.

entslscheia commented 4 years ago

@matt-gardner Thanks! It looks like there is no requires_grad option for the new Bert usage from pretrained_transformer, but I remember there is this option for the previous bert-pretrained, right? And previously I can use Bert (use option "type": "bert-pretrained", "pertained_model": "bert-base-uncased", "requires_grad": false) with a single GPU of 11GB, but now I cannot run it even use 2 GPUs use the option "type": "pretrained_transformer", "model_name": "bert-base-uncased". This seems kinda unexpected to me since they are just different interfaces using the exactly same model from Huggingface right?

matt-gardner commented 4 years ago

Hmm, I don't know why you would get more memory usage with the new transformers, as it's exactly the same code underneath. If you can verify this increase in memory usage with other models, it's worth opening an issue on the main repo to point this out.

For the new pretrained transformers, we left out that option, as you basically always want it to be True. If you (or someone else) really wants to have it as an option, open a PR to add it.

entslscheia commented 4 years ago

@matt-gardner I found the reason for the memory increase... It's simply because that for the old usage I set requires_grad to be false, while for the new usage it's always true. It takes more memory to fine-tune BERT's parameters, so that's actually not a problem. Many thanks for your suggestions!