Tokenization bug: labels do not include language AMR code

Currently, encode_penmanstrs calls the regular __call__ method of the tokenizer. The tokenizer does not know that this is supposed to tokenize the "target" (labels) and therefore incorrectly formats the string as an input string. In MBART, the input/label format is as follows:

Source: x_1 ... x_n [eos] [src_lang_code] Target: x_1 ... x_n [eos] [tgt_lang_code]

(The target (labels) will get right shifted later on to serve as input into the decoder. See MBART.)

A solution would be to use the tokenizer's target_text argument to use the taret language code. However, it is not straightforward to add a special language code to the MBART tokenizer. We can add regular tokens, but language tokens seem not straightforward. In other words, the added token amr_XX is not considered a special language token.

So the best way forward seems to:

replace the final token after tokenization with amr_XX
make sure that we delete the first token of the decoder's output before decoding (because the tokenizer's skip_special_tokens cannot filter out amr_XX because it does not know that it is a special token)

BramVanroy / multilingual-text-to-amr

Tokenization bug: labels do not include language AMR code #3