AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
214 stars 59 forks source link

Distillation Joint Translate Bug #76

Open harshyadav17 opened 3 months ago

harshyadav17 commented 3 months ago

hey @PranjalChitale

It would be really great if you can add a readme file for Distillation branch.

I have setup and installed the dependencies using the readme present in the previous commit of Distillation.

On trying to run the join_translate.sh file I am facing the following issue:


Fri May 24 07:57:25 UTC 2024
Applying normalization and script conversion
Input size: 1012
Traceback (most recent call last):
  File "/workspace/research/IndicTrans2/DISTIL/scripts/preprocess_translate.py", line 14, in <module>
    loader.load()
  File "/opt/conda/envs/itdv2/lib/python3.11/site-packages/indicnlp/loader.py", line 28, in load
    indic_scripts.init()
  File "/opt/conda/envs/itdv2/lib/python3.11/site-packages/indicnlp/script/indic_scripts.py", line 105, in init
    ALL_PHONETIC_DATA = pd.read_csv(
                        ^^^^^^^^^^^^
  File "/opt/conda/envs/itdv2/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/itdv2/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/itdv2/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/itdv2/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/opt/conda/envs/itdv2/lib/python3.11/site-packages/pandas/io/common.py", line 882, in get_handle
    handle = open(handle, ioargs.mode)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/envs/itdv2/lib/python3.11/site-packages/RESOURCES/script/all_script_phonetic_data.csv'
harshyadav17 commented 3 months ago

@PranjalChitale @VarunGumma Also, even if I bypass this issue fairseq-interactive isn't taking in the input file.

Following is the input command: fairseq-interactive ${ckpt_dir}/final_bin \ --distributed-world-size 1 --memory-efficient-fp16 \ --path ${ckpt_dir}/models/checkpoint_best.pt \ --task translation \ --source-lang SRC --target-lang TGT \ --batch-size 256 --buffer-size 2500 --beam 5 \ --num-workers 24 \ --skip-invalid-size-inputs-valid-test \ --input $outfname.bpe > $outfname.log 2>&1

In the above syntax, --input parameter has the valid outfname.bpe file but in the logs I am unable to check this as input. I am attaching the cfg.interactive, the one logged by the script, input should not be equal to '-'.

"interactive":{ "_name":"None", "buffer_size":2500, "input":"-", "force_override_max_positions":"None"}

PranjalChitale commented 3 months ago

The issue described above is due to the IndicNLP resources not being installed and the path not being set correctly.

Please refer to this link for guidance.

Additionally, because the preprocessing failed, the outfname.bpe file was not created successfully.

Regarding the README for the distillation branch, it may have been accidentally removed in the previous commit.

An up-to-date README will be added soon.

harshyadav17 commented 3 months ago

@PranjalChitale As mentioned above, I was able to solve the initial issue of IndicNLP, and after having the correct outfname.bpe (verified the file) I faced the above mentioned issue where the fairseq-interactive wasn't considering in the given input.