BramVanroy / multilingual-text-to-amr

GNU General Public License v3.0
5 stars 0 forks source link

Linearization and tokenization #1

Closed BramVanroy closed 2 years ago

BramVanroy commented 2 years ago

Adds linearization of penman trees, based on a depth-first traversal of the tree. Linearization/delinearization is kept simple and effective by using string matching or regex where applicable.

Adds tokenization by modifying the MBartTokenizer. We add some custom tokens, and update the encode/decode methods to take full advantage of our linearization process. Here, especially the way to fix spaces is important to ensure that a new tree can correctly be parsed from the delinearized output.

Tests have been run on the whole AMR3.0 corpus to make sure that we can indeed: