Preprocessing - Githubissues

QAQ-v commented 5 years ago

Hi,

Could you please release the preprocessing codes for generating the structural sequence and the commands for applying bpe? i.e., how to get the files in corpus_sample/all_path_corpus and corpus_sample/five_path_corpus.

Thanks.

Amazing-J commented 5 years ago

Python has an [“anytree”] (https://pypi.org/project/anytree/2.1.4/) . You can try.

QAQ-v commented 5 years ago

Python has an [“anytree”] (https://pypi.org/project/anytree/2.1.4/) . You can try.

Thanks for your reply! I am still confused about how to get the structural sequence, maybe releasing the preprocessing codes or the preprocessing data is a better way to help people run your model.

Meanwhile, there is another question. I trained Transformer baseline model implemented by OpenNMT with same hyperparameter setting as yours on LDC2015E86. When I compute the BLEU score on the BPE embedding prediction I can get a comparable result in Table 3 of your paper (25.5), but after I remove the "@@" in the prediction the BLEU droped a lot. So I am wondering that the BLEU results you reported in Table 3 was computed based on the BPE embedding prediction? Did you remove the "@@" in the final prediction of the model?

Amazing-J commented 5 years ago

After deleting "@@ ", the BLEU value should not decline, but rise a lot. Are you sure you are doing the right BPE process? It is worth noting that not only "@@" but also a space has been deleted. ( "@@ " ) The target side should do nothing but tokenization ( use PTB_tokenizer ).

QAQ-v commented 5 years ago

After deleting "@@ ", the BLEU value should not decline, but rise a lot. Are you sure you are doing the right BPE process? It is worth noting that not only "@@" but also a space has been deleted. ( "@@ " ) The target side should do nothing but tokenization ( use PTB_tokenizer ).

Thanks for your reply!

I follow the author's instruction to delete "@@ " (sed -r 's/(@@ )|(@@ ?$)//g') so there shouldn't be any mistakes. So you mean you only apply the BPE on source side? On the target side you do not apply the BPE? But in this way the source and target sides do not share the same alphabet, you still share the vocab in the model? Could you please release the code for BPE maybe it is more efficient and clear.

Amazing-J commented 5 years ago

What I mean is that the source and target segment needs to do BPE during training, and the target segment does not need to do BPE during testing.
BPE is a commonly used method in machine translation, there is no special code ah.

QAQ-v commented 5 years ago

What I mean is that the source and target segment needs to do BPE during training, and the target segment does not need to do BPE during testing. BPE is a commonly used method in machine translation, there is no special code ah.

Thanks for your patient reply!

I am still a little confused. So you only apply the BPE on the training set, and do not apply the BPE on the test set, is that right? Or you also apply the BPE on the source side of test set but do not apply BPE on the target side of test set?

Amazing-J commented 5 years ago

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

QAQ-v commented 5 years ago

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

Get it :). I will have a try, thanks!

QAQ-v commented 5 years ago

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

Sorry for bothering again, what is the {num_operations} set in the following command, the default value 10000?

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}

Amazing-J commented 5 years ago

On LDC2015E86 10000 On LDC2017T10 20000 train_file: cat train_source+train_target

QAQ-v commented 5 years ago

train_source+train_target

So you follow the instructions in BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT , right?

If it is, the --vocabulary-threshold you still keep 50?

Amazing-J commented 5 years ago

        You only need to use these two commands. subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}

subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}

        On 09/26/2019 22:15, Will wrote:

train_source+train_target

So you follow this instruction, right? If it is, the --vocabulary-threshold you still keep 50?

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/Amazing-J/structural-transformer/issues/3?email_source=notifications\u0026email_token=AJC27CN6BB7MCH3VJZTQXPTQLS7YFA5CNFSM4IX6GGS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7VW7OY#issuecomment-535523259", "url": "https://github.com/Amazing-J/structural-transformer/issues/3?email_source=notifications\u0026email_token=AJC27CN6BB7MCH3VJZTQXPTQLS7YFA5CNFSM4IX6GGS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7VW7OY#issuecomment-535523259", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Bobby-Hua commented 2 years ago

@Amazing-J Hi! I have the same question regarding generating structural sequences. Can you provide more insight on how to use [“anytree”] (https://pypi.org/project/anytree/2.1.4/) to get corpus_sample/all_path_corpus and corpus_sample/five_path_corpus? Any example preprocessing code will be much appreciated!

Amazing-J / structural-transformer

Preprocessing #3