e-bug / pascal

[ACL 2020] Code and data for our paper "Enhancing Machine Translation with Dependency-Aware Self-Attention"
https://www.aclweb.org/anthology/2020.acl-main.147/
MIT License
22 stars 10 forks source link

about bpe_tags_mean.py #2

Closed duterscmy closed 4 years ago

duterscmy commented 4 years ago

When "-" is included in an English sentence, the result of word segmentation and the result of syntactic analysis are inconsistent in the treatment of "-", resulting in different list lengths. I see that you choose to use "exit(0)" to exit, but the correct parse result cannot be obtained in this way. Is there any good way to deal with it? Ignore these sentences?

duterscmy commented 4 years ago

Besides "-", it seems that there are other situations that may lead to the inconsistency between sentence length and analytic length. How do you deal with it? thx!

e-bug commented 4 years ago

It's been a while since I prepared the data but what I did was trying to fix those instances that failed by adding specific rules.

If you don't want to do this, you could:

  1. check that not many sentences are dropped
  2. train the Transformer baseline on the preserved sentences only
  3. train the Transformer baseline on the full corpus
  4. ensure that results from 2. and 3. are consistent
duterscmy commented 4 years ago

thanks a lot~