On punctuation and capitalization

1-800-BAD-CODE commented 2 years ago

Here's a few feature requests + bugs related to the punctuation and capitalization model.

Punctuation issues

Inverted punctuation

For languages like Spanish, we need two predictions per token to account for the possibility of inverted punctuation tokens preceding a word.

Subword masking

By always applying subtoken masks, continuous-script languages (e.g., Chinese) cannot be punctuated without applying some sort of pre-processing. It would be useful if the model processed text in its native script.

Arbitrary punctuation tokens

The text-based data set does not allow to punctuate languages such as Thai, where a space character is a punctuation token. These languages could work by having a token-based data set and removing subword masks (essentially, resolving the other issues resolves this one).

Capitilization issues

Acronyms

The capitalization prediction is simply whether a word starts with a capital letter. So acronyms like 'amc' will not be correctly capitalized.

Names that begin with a particle

Similar to the acronym issue, words that begin with a particle, e.g., 'mcdonald', cannot be properly capitalized to 'McDonald'.

Capitalization is independent of punctuation

Currently, the two heads are conditioned only on the encoder's output and independent of each other. But capitalization is dependent on punctuation in many cases.

An example of what might go wrong is we may end up with "Hello, world, What's up?" because the capitalization model might expect a period after 'world'. Essentially the capitalization head is predicting what the punctuation head will do.

In practice I have found this problem to be an uncommon manifestation, but to be correct, capitalization should take into account the punctuator's output. Implicitly, we are forcing the capitalization head to learn punctuation (and predict the punctuation head's output).

Data set issues

Dataset as text

First (this may be a personal preference), it is unnatural to have a preprocessed data set in text format rather than token IDs. More importantly, text data sets are incompatible with other issues mentioned in this thread (subwords, space as punctuation)

Supported data set classes

Dataset is fixed to one class, but it would be more convenient to simply expect an abstract base class, and let the user specify a _target_ in the config and use hydra.utils.instantiate, as in some other models. E.g., https://github.com/NVIDIA/NeMo/blob/bc6215f166e69502fd7784fc73a5c2c39b465819/nemo/collections/tts/models/melgan.py#L298

For example, a user may wish to implement a different dataset that generates examples on-the-fly, or use a ConcatDataset with multiple languages and temperature sampling, etc.

Paragraph segmentation

A primary benefit of this model is to improve NMT results on unpunctuated and uncased ASR output. However, running MT on an arbitrarily-long inputs will inevitably end poorly.

For this model to be complete, I would argue it needs to implement a third token classification analytic: paragraph segmentation (splitting a paragraph into its constituent sentences). Translating sentences as separate units would improve results in many cases. Furthermore, a Transformer's runtime complexity is N^2 in the sequence length.

1-800-BAD-CODE commented 2 years ago

I'm willing to implement and contribute the following solution if there is sufficient interest.

This solution should generalize to any language and solve all problems mentioned above. Given that it is vastly different from the current punctuation + capitalization model, I would propose a separate model (e.g. TextStructuringModel) rather than trying to change the existing model.

Here's a little bit of reasoning on which this solution is founded:

Premise 1. Subword tokenizers should be preferred over character tokenizers.
Premise 2. Capitalization requires a character tokenizer.
Implication 1: Two tokenizers should be used: subword for punctuation/segmentation, character for capitalization.
Implication 2: The model will use two language models/encoders, one for each tokenizer. [Maybe the vocab can be combined and use a single LM.]

Premise 1. The model should generalize well to all languages and have minimal language-specific paths.
Premise 2: It's OK to make unnecessary predictions if they are reasonably cheap.
Implication 1: The punctuation head should always make two predictions per token, to account for Spanish, Asturian, etc. The model will learn that other languages have no punctuation before tokens and simply predict null.
Implication 2: The model should never use a subword mask, to allow continuous-script languages to be punctuated and segmented between subword units. The model will learn that non-continuous languages are never broken or punctuated between subword units.

Premise 1: Each analytic should be a function of the output of all dependent analytics.
Implication 1: At inference time, the analytics are chained together such that punctuation sees unstructured text, capitalization sees punctuated text, and segmentation sees capitalized and punctuated text.
Implication 2: At training time, the data loader produces 3 different versions of the inputs to provide reference data w.r.t. the particular analytic. Essentially, we are using teacher forcing to provide the reference output of dependent analytics.

The model's components are shown here, along with the training algorithm:

text_structurizer_train

At inference time, it would run like this:

text_structurizer_infer

ekmb commented 2 years ago

Hi @1-800-BAD-CODE, thank you for bringing up these valid issues and your willingness to contribute. It would be great if you could add a model that doesn't rely on subword masking with the punctuation head that predicts pre-and post- punctuation marks. I agree that having a separate model would make more sense, and we can then re-factor/depreciate the current model.

A few questions/comments:

If a training example for the punctuation model represents 1+ sentences, the model learns to split them into separate sentences, i.e., put the end of sentence punctuation marks in between the sentences. Paragraph segmentation seems redundant. Please also see this method that removes 'margin' probabilities when combining punctuation predictions from multiple segments (This was specifically added to segment long input texts for NMT task.)
I agree that capitalization should depend on punctuation. As for char-based tokenization, using a single LM would be preferable. The vocabularies of the most pre-trained transformer models include single chars (see BERT vocab.txt). However, the chars don't carry any semantic meaning, and the input sequence is getting longer (this will affect the segmentation head if you still want it). An alternative solution could be to use the same tokenization for the subword task, add punctuation head output, similarly to [this] (https://assets.amazon.science/55/3f/8c27b4014bdd983087fdb1d73412/robust-prediction-of-punctuation-and-truecasing-for-medical-asr.pdf) and use capitalization tags: {first_letter_upper, all_caps, all_lower, mixed (for McDonald cases)}.

NVIDIA / NeMo