BramVanroy / multilingual-text-to-amr

GNU General Public License v3.0
5 stars 0 forks source link

Tokenization bug: labels do not include language AMR code #3

Closed BramVanroy closed 2 years ago

BramVanroy commented 2 years ago

Currently, encode_penmanstrs calls the regular __call__ method of the tokenizer. The tokenizer does not know that this is supposed to tokenize the "target" (labels) and therefore incorrectly formats the string as an input string. In MBART, the input/label format is as follows:

Source: x_1 ... x_n [eos] [src_lang_code] Target: x_1 ... x_n [eos] [tgt_lang_code]

(The target (labels) will get right shifted later on to serve as input into the decoder. See MBART.)

A solution would be to use the tokenizer's target_text argument to use the taret language code. However, it is not straightforward to add a special language code to the MBART tokenizer. We can add regular tokens, but language tokens seem not straightforward. In other words, the added token amr_XX is not considered a special language token.

So the best way forward seems to: