Closed 1-800-BAD-CODE closed 2 years ago
I'm willing to implement and contribute the following solution if there is sufficient interest.
This solution should generalize to any language and solve all problems mentioned above. Given that it is vastly different from the current punctuation + capitalization model, I would propose a separate model (e.g. TextStructuringModel
) rather than trying to change the existing model.
Here's a little bit of reasoning on which this solution is founded:
The model's components are shown here, along with the training algorithm:
At inference time, it would run like this:
Hi @1-800-BAD-CODE, thank you for bringing up these valid issues and your willingness to contribute. It would be great if you could add a model that doesn't rely on subword masking with the punctuation head that predicts pre-and post- punctuation marks. I agree that having a separate model would make more sense, and we can then re-factor/depreciate the current model.
A few questions/comments:
McDonald
cases)}.
Here's a few feature requests + bugs related to the punctuation and capitalization model.
Punctuation issues
Inverted punctuation
For languages like Spanish, we need two predictions per token to account for the possibility of inverted punctuation tokens preceding a word.
Subword masking
By always applying subtoken masks, continuous-script languages (e.g., Chinese) cannot be punctuated without applying some sort of pre-processing. It would be useful if the model processed text in its native script.
Arbitrary punctuation tokens
The text-based data set does not allow to punctuate languages such as Thai, where a space character is a punctuation token. These languages could work by having a token-based data set and removing subword masks (essentially, resolving the other issues resolves this one).
Capitilization issues
Acronyms
The capitalization prediction is simply whether a word starts with a capital letter. So acronyms like 'amc' will not be correctly capitalized.
Names that begin with a particle
Similar to the acronym issue, words that begin with a particle, e.g., 'mcdonald', cannot be properly capitalized to 'McDonald'.
Capitalization is independent of punctuation
Currently, the two heads are conditioned only on the encoder's output and independent of each other. But capitalization is dependent on punctuation in many cases.
An example of what might go wrong is we may end up with "Hello, world, What's up?" because the capitalization model might expect a period after 'world'. Essentially the capitalization head is predicting what the punctuation head will do.
In practice I have found this problem to be an uncommon manifestation, but to be correct, capitalization should take into account the punctuator's output. Implicitly, we are forcing the capitalization head to learn punctuation (and predict the punctuation head's output).
Data set issues
Dataset as text
First (this may be a personal preference), it is unnatural to have a preprocessed data set in text format rather than token IDs. More importantly, text data sets are incompatible with other issues mentioned in this thread (subwords, space as punctuation)
Supported data set classes
Dataset is fixed to one class, but it would be more convenient to simply expect an abstract base class, and let the user specify a
_target_
in the config and use hydra.utils.instantiate, as in some other models. E.g., https://github.com/NVIDIA/NeMo/blob/bc6215f166e69502fd7784fc73a5c2c39b465819/nemo/collections/tts/models/melgan.py#L298For example, a user may wish to implement a different dataset that generates examples on-the-fly, or use a ConcatDataset with multiple languages and temperature sampling, etc.
Paragraph segmentation
A primary benefit of this model is to improve NMT results on unpunctuated and uncased ASR output. However, running MT on an arbitrarily-long inputs will inevitably end poorly.
For this model to be complete, I would argue it needs to implement a third token classification analytic: paragraph segmentation (splitting a paragraph into its constituent sentences). Translating sentences as separate units would improve results in many cases. Furthermore, a Transformer's runtime complexity is N^2 in the sequence length.