--replace-unk causes bugs with fairseq-interactive

jm-glowienke commented 3 years ago

🐛 Bug

When using farseq-interactive to generate translations, the --replace-unk argument causes several bugs.

The alignments are given as tuple, but the function apparently just expects a list of indices of the aligned source token.
When no alignment file is give, the standard input configuration '@@' causes alignment file loader to break.
At last, when the out-of-vocabulary (OOV) word in the hypothesis is also OOV in the source dictionary, then you still get an <unk> in your translation. So I think, it would be good that in this case the original input is used to replace the <unk> in the translation.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

python fairseq-interactive.py fairseq-data-bin-10752
--path models/transformer_iwslt_de_en_10752-align/checkpoint_best.pt
--beam 5 --source-lang nl
--target-lang ql
--print-alignment --replace-unk
--tokenizer moses

input text: legal name of allianz Allianz is an OOV word for my task.

For 1.

Traceback (most recent call last):
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 318, in <module>
    cli_main()
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 314, in cli_main
    distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 267, in main
    hypo_tokens, hypo_str, alignment = utils.post_process_prediction(
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/utils.py", line 246, in post_process_prediction
    hypo_str = replace_unk(hypo_str, src_str, alignment, align_dict,
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/utils.py", line 222, in replace_unk
    src_token = src_tokens[alignment[i]]
TypeError: list indices must be integers or slices, not tuple

For 2.:

Traceback (most recent call last):
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 318, in <module>
    cli_main()
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 314, in cli_main
    distributed_utils.call_main(convert_namespace_to_omegaconf(args), main)
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq_cli/interactive.py", line 191, in main
    align_dict = utils.load_align_dict(cfg.generation.replace_unk)
  File "/Users/jan_marcglowienke/Documents/University/Master_Courses/Thesis/10_fairseq/fairseq/utils.py", line 164, in load_align_dict
    with open(replace_unk, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '@@ '

Expected behavior

Replace the <unk> in the hypothesis by the corresponding word in the input according to the alignments. This should also be possible without an alignment dictionary.

I made a fix for 1., found a workaround for 2. and added some code to include feature described in 3.

I can provide a PR, if wished

Environment

fairseq Version (e.g., 1.0 or master): master, '1.0.0a0+2429317'
PyTorch Version (e.g., 1.0): 1.8.1
OS (e.g., Linux): MacOS 11.2.3
How you installed fairseq (pip, source): CFLAGS="-stdlib=libc++" pip install --editable ./
Build command you used (if compiling from source):
Python version: 3.8.8
CUDA/cuDNN version: -
GPU models and configuration: -
Any other relevant information:

Additional context

xihajun commented 2 years ago

the original input is used to replace the in the translation.

Hi @jm-glowienke may I ask if you fixed the issue, it seems that it is still not working for the task of the original input is used to replace the <unk> in the translation.

arkanto99 commented 2 years ago

Hi @jm-glowienke I would also like to know if there is any solution to this issue

xihajun commented 2 years ago

Hi @jm-glowienke I would also like to know if there is any solution to this issue

Are you also applying for transformer model?

This blog explained a bit about why their -replace-unk is not working for the transformer model. https://forum.opennmt.net/t/translate-py-with-replace-unk-option-and-the-transformer-model/2646

might be helpful somehow

[Update on Dec 04, 2022] My task was doing spelling correction, and I was trying to skip all the special characters to unk. I used an alternative way to achieve that:

replace all the special characters eg, 0-9 to for paired data (maybe also works for names and other words)
train the model
replace them back in order

xihajun commented 2 years ago

In this commit, they tried to add -replace-unk feature, but not sure if we have to go back to that version https://github.com/facebookresearch/fairseq/commit/4815ed4d5e2b50fe85573a045fdf486ff8e64a58

jm-glowienke commented 1 year ago

Hi, I found a solution for the problems described in the issue. They can be found on my personal fork of fairseq: https://github.com/jm-glowienke/fairseq Unfortunately, I cannot help you any further, as I only worked on this for my thesis almost 2 years ago.

facebookresearch / fairseq