ServiceNow completed its acquisition of Element AI on January 8, 2021. All references to Element AI in the materials that are part of this project should refer to ServiceNow.
This is the official implementation of the following paper: [Torsten Scholak](https://twitter.com/tscholak), Nathan Schucher, Dzmitry Bahdanau. [PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models](https://arxiv.org/abs/2109.05093). *Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).* If you use this code, please cite: ```bibtex @inproceedings{Scholak2021:PICARD, author = {Torsten Scholak and Nathan Schucher and Dzmitry Bahdanau}, title = "{PICARD}: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.779", pages = "9895--9901", } ``` ## Watch The Video [![Watch the video](https://img.youtube.com/vi/kTpixsr-37w/maxresdefault.jpg)](https://youtu.be/kTpixsr-37w) ## Overview This code implements: * The PICARD algorithm for constrained decoding from language models. * A text-to-SQL semantic parser based on pre-trained sequence-to-sequence models and PICARD achieving state-of-the-art performance on both the [Spider](https://yale-lily.github.io/spider) and the [CoSQL](https://yale-lily.github.io/cosql) datasets. ## About PICARD > **TL;DR:** We introduce PICARD -- a new method for simple and effective constrained decoding from large pre-trained language models. > On the challenging Spider and CoSQL text-to-SQL datasets, PICARD significantly improves the performance of fine-tuned but otherwise unmodified T5 models. > Using PICARD, our T5-3B models achieved state-of-the-art performance on both Spider and CoSQL. In text-to-SQL translation, the goal is to translate a natural language question into a SQL query. There are two main challenges to this task: 1. The generated SQL needs to be semantically correct, that is, correctly reflect the meaning of the question. 2. The SQL also needs to be valid, that is, it must not result in an execution error. So far, there has been a trade-off between these two goals: The second problem can be solved by using a special decoder architecture that -- by construction -- always produces valid SQL. This is the approach taken by most prior work. Those decoders are called "constrained decoders", and they need to be trained from scratch on the text-to-SQL dataset. However, this limits the generality of the decoders, which is a problem for the first goal. A better approach would be to use a pre-trained encoder-decoder model and to constrain its decoder to produce valid SQL after fine-tuning the model on the text-to-SQL task. This is the approach taken by the PICARD algorithm. ### How is PICARD different from existing constrained decoders? * It’s an incremental parsing algorithm that integrates with ordinary beam search. * It doesn’t require any training. * It doesn’t require modifying the model. * It works with any model that generates a sequence of tokens (including language models). * It doesn’t require a special vocabulary. * It works with character-, sub-word-, and word-level language models. ### How does PICARD work? The following picture shows how PICARD is integrated with beam search.
Decoding starts from the left and proceeds to the right.
The algorithm begins with a single token (usually ``),
and then keeps expanding the beam with hypotheses generated token-by-token by the decoder.
At each decoding step and for each hypothesis,
PICARD checks whether the next top-`k` tokens are valid.
In the image above, only 3 token predictions are shown, and `k` is set to 2.
Valid tokens (☑) are added to the beam. Invalid ones (☒) are discarded. The `k+1`-th, `k+2`-th, ... tokens are discarded, too.
Like in normal beam search, the beam is pruned to contain only the top-`n` hypotheses.
`n` is the beam size, and in the image above it is set to 2 as well.
Hypotheses that are terminated with the end-of-sentence token (usually ``) are not expanded further.
The algorithm stops when the all hypotheses are terminated
or when the maximum number of tokens has been reached.
### How does PICARD know whether a token is valid?
In PICARD, checking, accepting, and rejecting of tokens and token sequences is achieved through *parsing*.
Parsing means that we attempt to assemble a data structure from the tokens
that are currently in the beam or are about to be added to it.
This data structure (and the parsing rules that are used to build it) encode the constraints we want to enforce.
In the case of SQL, the data structure we parse to is the abstract syntax tree (AST) of the SQL query.
The parsing rules are defined in a computer program called a parser.
Database engines, such as PostgreSQL, MySQL, and SQLite, have their own built-in parser that they use internally to process SQL queries.
For Spider and CoSQL,
we have implemented a parser that supports a subset of the SQLite syntax and that checks additional constraints on the AST.
In our implementation,
the parsing rules are made up from simpler rules and primitives that are provided by a third-party parsing library.
PICARD uses a parsing library called [attoparsec](https://hackage.haskell.org/package/attoparsec) that supports incremental input.
This is a special capability that is not available in many other parsing libraries.
You can feed attoparsec a string that represents only part of the expected input to parse.
When parsing reaches the end of an input fragment,
attoparsec will return a [continuation function](https://hackage.haskell.org/package/attoparsec-0.14.1/docs/Data-Attoparsec-Text.html#t:IResult)
that can be used to continue parsing.
Think of the continuation function as a suspended computation that can be resumed later.
Input fragments can be parsed one after the other when they become available until the input is complete.
Herein lies the key to PICARD:
Incremental parsing of input fragments is exactly what we need to check tokens one by one during decoding.
In PICARD,
parsing is initialized with an empty string, and attoparsec will return the first continuation function.
We then call that continuation function with all the token predictions we want to check in the first decoding step.
For those tokens that are valid, the continuation function will return a new continuation function
that we can use to continue parsing in the next decoding step.
For those tokens that are invalid, the continuation function will return a failure value which cannot be used to continue parsing.
Such failures are discarded and never end up in the beam.
We repeat the process until the end of the input is reached.
The input is complete once the model predicts the end-of-sentence token.
When that happens, we finalize the parsing by calling the continuation function with an empty string.
If the parsing is successful, it will return the final AST.
If not, it will return a failure value.
The parsing rules are described at a high level in the [PICARD paper](https://arxiv.org/abs/2109.05093).
For details, see the PICARD code, specifically the [Language.SQL.SpiderSQL.Parse module](https://github.com/ElementAI/picard/blob/main/picard/src/Language/SQL/SpiderSQL/Parse.hs).
### How well does PICARD work?
Let's look at the numbers:
#### On [Spider](https://yale-lily.github.io/spider)
URL | Based on | Exact-set Match Accuracy | Execution Accuracy | ||
---|---|---|---|---|---|
Dev | Test | Dev | Test | ||
tscholak/cxmefzzi w PICARD | T5-3B | 75.5 % | 71.9 % | 79.3 % | 75.1 % |
tscholak/cxmefzzi w/o PICARD | T5-3B | 71.5 % | 68.0 % | 74.4 % | 70.1 % |
tscholak/3vnuv1vf w PICARD | t5.1.1.lm100k.large | 74.8 % | — | 79.2 % | — |
tscholak/3vnuv1vf w/o PICARD | t5.1.1.lm100k.large | 71.2 % | — | 74.4 % | — |
tscholak/1wnr382e w PICARD | T5-Large | 69.1 % | — | 72.9 % | — |
tscholak/1wnr382e w/o PICARD | T5-Large | 65.3 % | — | 67.2 % | — |
tscholak/1zha5ono w PICARD | t5.1.1.lm100k.base | 66.6 % | — | 68.4 % | — |
tscholak/1zha5ono w/o PICARD | t5.1.1.lm100k.base | 59.4 % | — | 60.0 % | — |
URL | Based on | Question Match Accuracy | Interaction Match Accuracy | ||
---|---|---|---|---|---|
Dev | Test | Dev | Test | ||
tscholak/2e826ioa w PICARD | T5-3B | 56.9 % | 54.6 % | 24.2 % | 23.7 % |
tscholak/2e826ioa w/o PICARD | T5-3B | 53.8 % | 51.4 % | 21.8 % | 21.7 % |
tscholak/2jrayxos w PICARD | t5.1.1.lm100k.large | 54.2 % | — | — | — |
tscholak/2jrayxos w/o PICARD | t5.1.1.lm100k.large | 52.5 % | — | — | — |