facebookresearch / Mask-Predict

A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation.
Other
240 stars 38 forks source link

why a lot of @@ in the data #12

Closed yeliu918 closed 4 years ago

yeliu918 commented 4 years ago

Hi,

I notice that there a lot of @@ in the data. For example, "Gut@@ ach : Incre@@ ased safety for pedestri@@ ans". It seems like that "Incre@@ ased" means "Increased". Should we revise the file such that deleting the @@ and combine two tokens to one token? I think for the preprocess.py ignore such a problem. And create the dictionary that contains a lot of words that have "@@".

Best, Ye