Minys233 / Dynaformer

MIT License
28 stars 1 forks source link

Unable to make predictions due to missing script: graph_prediction_with_flag.py #2

Open osession opened 1 year ago

osession commented 1 year ago

Hi, I have been trying to run the run_evaluation.sh with the provided checkpoints downloaded and unzipped to the checkpoints directory. I am running into this error:

evaluate.py: error: argument --task: invalid choice: 'graph_prediction_with_flag' (choose from 'hubert_pretraining', 'denoising', 'multilingual_denoising', 'translation', 'multilingual_translation', 'translation_from_pretrained_bart', 'translation_lev', 'language_modeling', 'speech_to_text', 'legacy_masked_lm', 'text_to_speech', 'speech_to_speech', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'audio_pretraining', 'semisupervised_translation', 'frm_text_to_speech', 'sentence_prediction', 'cross_lingual_lm', 'translation_from_pretrained_xlm', 'multilingual_language_modeling', 'audio_finetuning', 'masked_lm', 'sentence_ranking', 'translation_multi_simple_epoch', 'multilingual_masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')

I can't find the graph_prediction_with_flag.py script anywhere else and was curious if it has just been removed permanently or if there is another way to run predictions?

Thanks!

Minys233 commented 1 year ago

Hi, actually, this 'graph_prediction_with_flag' is a custom registered task in the fairseq framework, located here:

https://github.com/Minys233/dynaformer_model/blob/c9942c389e545a5f43f0834031ce36034cb9b343/dynaformer/tasks/graph_prediction.py#L277-L282

This custom task normally should be imported in the runtime to register the task. The default parameter in the evaluate.sh script defines the location of 'graph_prediction_with_flag' task:

https://github.com/Minys233/dynaformer_model/blob/c9942c389e545a5f43f0834031ce36034cb9b343/examples/evaluate/evaluate.sh#L27

So the problem for you is the evaluate.py code can't find this custom task. Maybe you are not running ./run_custom_input.sh command in the Dynaformer (project root) directory, which makes the relative path not valid. Or maybe you're running evaluate.py with fewer parameters.

Please follow the steps in README.md, if this problem still exist, please post detailed steps here, and I will happy to see what happened :D

osession commented 1 year ago

I am running ./run_evaluate.sh in the home directory. I don't think I am running evaluate.py with fewer parameters since I have not modified any of the evaluate.sh file. It seems like instead of looking in this file path that you showed (https://github.com/Minys233/dynaformer_model/blob/c9942c389e545a5f43f0834031ce36034cb9b343/examples/evaluate/evaluate.sh#L27), it is maybe instead looking here? https://github.com/facebookresearch/fairseq/tree/98ebe4f1ada75d006717d84f9d603519d8ff5579/fairseq/tasks

At least those are all the other names of the tasks that are being listed in the error that I'm still getting.

osession commented 1 year ago

I think I figured out the issue. I was getting this error: Dynaformer/examples/evaluate/evaluate.sh: line 25: realpath: command not found. So when I removed the realpath command and just replaced those lines with simply the string of the filepath, it was able to find the graph_prediction.py script. Thank you for your help!!

Minys233 commented 1 year ago

I think I figured out the issue. I was getting this error: Dynaformer/examples/evaluate/evaluate.sh: line 25: realpath: command not found. So when I removed the realpath command and just replaced those lines with simply the string of the filepath, it was able to find the graph_prediction.py script. Thank you for your help!!

Glad to hear this and thank you for pointing out this! After some googling and some testing, I find that realpath command is a part of coreutils, but in newer versions, this command is deprecated. More reliable readlink command should be used instead for the same purpose. I will soon update README.md and corresponding scripts.

ref: Discussions on Unix & Linux Stack Exchange

osession commented 1 year ago

Hello again, I've been trying to run bash Dynaformer/examples/md_pretrain/md_train.sh , but I am running into a similar issue that I had before with getting the 'invalid choice: graph_prediction_with_flag' error. It still isn't working this time even after adjusting the realpath command. Sorry to bring this up again!

fairseq-train: error: argument --task: invalid choice: 'graph_prediction_with_flag' (choose from 'translation', 'translation_from_pretrained_xlm', 'denoising', 'multilingual_denoising', 'speech_to_text', 'text_to_speech', 'hubert_pretraining', 'online_backtranslation', 'sentence_prediction', 'speech_to_speech', 'simul_speech_to_text', 'simul_text_to_text', 'audio_pretraining', 'audio_finetuning', 'cross_lingual_lm', 'frm_text_to_speech', 'multilingual_translation', 'translation_from_pretrained_bart', 'semisupervised_translation', 'multilingual_masked_lm', 'translation_multi_simple_epoch', 'language_modeling', 'multilingual_language_modeling', 'translation_lev', 'masked_lm', 'sentence_ranking', 'legacy_masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 96930) of binary: /home/ray/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ray/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
************************************************
  /home/ray/anaconda3/bin/fairseq-train FAILED  
================================================
Root Cause:
[0]:
  time: 2023-07-11_09:24:00
  rank: 0 (local_rank: 0)
  exitcode: 2 (pid: 96930)
  error_file: <N/A>
  msg: "Process failed with exitcode 2"
================================================
Other Failures:
  <NO_OTHER_FAILURES>
************************************************
osession commented 1 year ago

I figured out that the user directory was incorrect which was why it was unable to find the 'graph_prediction_with_flag' custom task. So I changed line 157 in md_train.sh from --user-dir "$(realpath ./dynaformer)" \ to --user-dir "$(realpath ./Dynaformer/dynaformer)" \.

However, the training is still stopping at this error:

Root at /home/ray/dataset
Loading hybrid data from md-refined2019-5-5-5, general-set-2019-coreset-2016
Downloading https://scientificdata.blob.core.windows.net/dynaformer/dataset/mddata/md-refined2019-5-5-5.zip
Extracting /home/ray/dataset/md-refined2019-5-5-5.zip
Processing...
Loading file: /home/ray/dataset/md-refined2019-5-5-5_train_val.pkl, exists? True
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 54483) of binary: /home/ray/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ray/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
*************************************************
   /home/ray/anaconda3/bin/fairseq-train FAILED  
=================================================
Root Cause:
[0]:
  time: 2023-07-13_09:03:04
  rank: 0 (local_rank: 0)
  exitcode: -9 (pid: 54483)
  error_file: <N/A>
  msg: "Signal 9 (SIGKILL) received by PID 54483"
=================================================
Other Failures:
  <NO_OTHER_FAILURES>
*************************************************
osession commented 1 year ago

I figured out the solution to the above error was to switch my head node to a type that had 122 GB instead of 30 GB of storage, and it seems to be working now.