facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.89k stars 498 forks source link

After running translate.py, there are many '@' in result file #269

Open 645709712 opened 4 years ago

645709712 commented 4 years ago

I am doing th-en translation,and after running translate.py,I got some results like this: image you see,many '@' in translated sentences.I think it's largely related to BPE algorithm.(in valid/test result,No '@') So,what should I do to solve or improve this problem? Thank you.

645709712 commented 4 years ago

W@@ ait for more money and then fill it up . Not sure . Un@@ comfortable . F@@ ail . Wr@@ ong push . It 's been up@@ load .

.... I don't think the word was restored after it was split, so what's wrong with that?

Raldir commented 4 years ago

Simply run (s + ' ').replace('@@', '').rstrip() on the output string s.

skifvideo commented 4 years ago

I think replace('@@ ', '') is a correct way. After all ' .' at the end looks ugly

RachitBansal commented 4 years ago

What is the conclusion here, @645709712?

Jeevesh8 commented 4 years ago

Use this function . Like so :-

import subprocess

def restore_segmentation(path):
    """
    Take a file segmented with BPE and restore it to its original segmentation.
    """
    assert os.path.isfile(path)
    restore_cmd = "sed -i -r 's/(@@ )|(@@ ?$)//g' %s"
    subprocess.Popen(restore_cmd % path.relpace(' ', '\ '), shell=True).wait()

for f in os.listdir(output_path):
    restore_segmentation(os.path.join(output_path, f))