Hi There, i got some Q and suggestions.

dreasysnail / POINTER

MIT License

112 stars 19 forks source link

thanks for your code =.=

Q: the model faces the problem with Error Accumulation, which i suggest use '[DEL]' token to let model know when to delete which words. just like the NAT method: levenshtein. the different is levenshtein do [where to insert / insert / delete] in different steps, yous can do in one step. But one step also get the problem with Unstable, you can got a result that insert and delete all happens in same step. after my test, '[DEL]' get PPL 10 scores decrease.
Q: lack of knowledge. this happens when i try the constraned the train data in some metaphor or parallelism data. but the result shows the output doesn't have the strong logic inside compared to GPT-2, it trends to generate [no, can't, doesn't; etc ] words which can completely change the means of sentences and bring the problem with different means in different part of sentences. i don't know how to fix it. maybe something like knowledge-bert ? maybe this is the disadvantages of NAT methods compare to AR model like GPT-2 which can't be solved, because of the unstable generate pattern?
your inference code in greedy_search maybe toooooo slow WHEN inference a batch data? i suggest a torch-mask version[1. get the index to mask , 2. use torch.scatter or torch.mask_fill etc to inference a batch data], after somedays I'll take a push requests and please check it.

Hi @wdyxwzyh, thanks for your questions and suggestions! They are really helpful.

Great suggestion. For this paper we are testing the insertion only scenario, which naturally fits into the insertion transformer framework. However I agree with you that incorporating 'DEL' in the token sets would be helpful, we have tried similar things but does not lead to significant improvement of the performance. It's great that you obtain PPL drop by adding the DEL token!
Another great point, thanks for your insights on this. We believe what you mentioned is true, and this direction still has many challenges. The reason might be that NAT methods typically have weaker dependency structure comparing to autoregressive counterpart. We are trying out solutions to make it more knowledge-grounded and semantically consistent. Would love to learn more if you have any additional thoughts.
We realize that the code implementation is not optimized. We haven't been able to optimize it yet, and we would be really grateful if you want to contribute and accelerate the generation code. Once you make a PR I will check it out and make sure the credit of this major update goes to you.

dreasysnail / POINTER