Adds linearization of penman trees, based on a depth-first traversal of the tree. Linearization/delinearization is kept simple and effective by using string matching or regex where applicable.
Adds tokenization by modifying the MBartTokenizer. We add some custom tokens, and update the encode/decode methods to take full advantage of our linearization process. Here, especially the way to fix spaces is important to ensure that a new tree can correctly be parsed from the delinearized output.
Tests have been run on the whole AMR3.0 corpus to make sure that we can indeed:
linearize a tree, delinearize it, and have the exact same tree as the initial tree
tokenize a linearized tree with a modified MBartTokenizer, decoding the token inputs and fixing spaces, and then delinearizing the tree. We also test that the final decoded tree is the same as the original tree
Adds linearization of penman trees, based on a depth-first traversal of the tree. Linearization/delinearization is kept simple and effective by using string matching or regex where applicable.
Adds tokenization by modifying the MBartTokenizer. We add some custom tokens, and update the encode/decode methods to take full advantage of our linearization process. Here, especially the way to fix spaces is important to ensure that a new tree can correctly be parsed from the delinearized output.
Tests have been run on the whole AMR3.0 corpus to make sure that we can indeed: