Hellisotherpeople / Constrained-Text-Generation-Studio

Code repo for "Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio" at the (CAI2) workshop, jointly held at (COLING 2022)
MIT License
207 stars 11 forks source link

Constrained-Text-Generation-Studio

PWC

Table of Contents

Introduction

"Constrained Text Generation Studio" (CTGS) is an AI writing assistant for recreational linguists, poets, creative writers, and/or researchers to use and study the ability of large-scale language models to generate constrained text.

CTGS allows users to generate or choose from text with any combination of a wide variety of constraints, such as banning a particular letter, forcing the generated words to have a certain number of syllables, and/or forcing the words to be partial anagrams of another word. A partial list of these sorts of constraints can be found here

CTGS uses an extremely simple and intuitive algorithm. At each generation, a language model is actually sampling from a probability distribution of its entire vocabulary (which is usually tokenized sub-words). Why don't we just ban the tokens within the vocabulary which violate the chosen constraints before the sampling step?. This has two advantages over fine-tuning. The first advantage is that the model will never violate the imposed constraint, which is unfortunately impossible to guarantee for a fine-tuned model alone. The second advantage is that on constrained writing datasets, this technique results in strictly superior preplexity over fine-tuning alone (which makes sense because we are literally banning errrors).

CTGS, along with the related contributions of its datasets, and a huggingface "space" webapp called Gadsby, are all presented as part of our paper titled "Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio" to appear at The Second Workshop on When Creative AI Meets Conversational AI (CAI2), jointly held at The 29th International Conference on Computational Linguistics (COLING 2022)

Features

CTGS consists of 3 main components, the model, the filters, and the text transformations.

HF Integration

CTGS supports any casual language model available on Huggingface. Future updates will add support for Masked Language Models, and for text-to-text models (which are supported at this time by Gadsby.

Filters

CTGS has 21 filters at this time. These filters are applied to all tokens in the LM vocabulary after any text-transforms have been applied. Any combination of these filters can be applied, as they are naturally composable.

The filters are as follows:

Text Transforms

Not all language models have the same kinds of vocabulary. Most vocabularies include a wide variety of sub-words, full-words, punctuation, spaces and misc combinations of the previously mentioned. Many of the filters are more effective when text normalization processes are ran. To that end, we also make textual transforms which operate before the filtering process available. There are 12 of them, and they are as follows:

Future Features

CTGS will massively benefit from the addition of several other features, which I am trying to add as my time allows, but with professional obligations this will be difficult to do as quickly as I'd like. For now, enumerating them here will hopefully plique a motivated persons interest to help knock these out and improve CTGS if I can't get to it in time.

Install Instructions

  1. Clone the repo
  2. cd into the repo directory (you may get font errors if you don't cd into it)
  3. pip install -r requirements.txt
  4. python3 Constrained-Text-Generation-Studio.py

Usage Instructions

The first time you run this, it may take a few minutes to be ready to run because distilgpt2 and fasttext are being downloaded from huggingface. Wait until you see a messege in the Model Settings window about it being succesfully loaded before trying to run CTGS

Right click anywhere within the text box for a list of continuations with the enabled filters to appear. Here the letter "e" is banned and the letter "a" is forced to appear

The F1 key generates new tokens given the context and filters (populates the right click continuations box), and is equivilant to the Predict New Tokens button

The F2 key directly inserts the next token into the text box using the models decoder (and top_p, top_k, temperature) settings. It's equivilant to the AI generate some tokens button. We can see an example of doing this with the default settings, with the letter "e" banned and the letter "a" is forced to appear:

If you're not seeing continuations using F2 or the AI generate some token button, make sure that it's not generating spaces, line returns, or other blank characters

You can enable and see which filters are enabled with the Filters window. In this example, we have banned the letter "e", and forced the letter "a" to appear.

Use the text transforms list to apply transforms to the vocabulary before the constraints are applied. To mitigate the problem of the LM generating spaces, you could for example use the filter blank outputs transform

After typing or copying/pasting text into the text box, use the Predict New Tokens button or F1 to get new continuations (what you see when you right click) given your context.

This utility is written using the DearPyGUI GUI library, and has the tiling mode enabled. You can move around the windows and tile them with each other to your hearts desire. I think a tool like this is a natural fit for a tiling window manager style layout

Hovering over a green question-mark will pop-up a tooltip to give you context/help