Linearization and tokenization

Adds linearization of penman trees, based on a depth-first traversal of the tree. Linearization/delinearization is kept simple and effective by using string matching or regex where applicable.

Adds tokenization by modifying the MBartTokenizer. We add some custom tokens, and update the encode/decode methods to take full advantage of our linearization process. Here, especially the way to fix spaces is important to ensure that a new tree can correctly be parsed from the delinearized output.

Tests have been run on the whole AMR3.0 corpus to make sure that we can indeed:

linearize a tree, delinearize it, and have the exact same tree as the initial tree
tokenize a linearized tree with a modified MBartTokenizer, decoding the token inputs and fixing spaces, and then delinearizing the tree. We also test that the final decoded tree is the same as the original tree

BramVanroy / multilingual-text-to-amr

Linearization and tokenization #1