After configuring the enviroment by requirements.txt, I prepared the pretraining and fintuning .tsv file.
When I excuted the pretrain_polyformer_b.sh, log in corresponding direcory shows:
train.py: error: unrecongnized arguments: --warmup-ratio=0.06.
Besides, evaluation failed when I excuted app.py, after downloading polyformer_b_refcoco.pt to ./weights.
2023-10-18 13:26:44 | INFO | ofa.evaluate | loading model(s) from ../../weights/polyformer_b_refcoco.pt
Traceback (most recent call last):
File "../../evaluate.py", line 185, in
cli_main()
File "../../evaluate.py", line 180, in cli_main
vis_dir=args.vis_dir, vis=args.vis, result_dir=args.result_dir
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/fairseq/distributed/utils.py", line 354, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "../../evaluate.py", line 81, in main
num_shards=cfg.checkpoint.checkpoint_shard_count,
File "/data/zdr/polygon-transformer/utils/checkpoint_utils.py", line 436, in load_model_ensemble_and_task
state = load_checkpoint_to_cpu(filename, arg_overrides)
File "/data/zdr/polygon-transformer/utils/checkpoint_utils.py", line 330, in load_checkpoint_to_cpu
state = torch.load(f, map_location=torch.device("cpu"))
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/serialization.py", line 600, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/serialization.py", line 242, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 50262) of binary: /opt/anaconda3/envs/zdr_polyformer/bin/python3
Traceback (most recent call last):
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
../../evaluate.py FAILED
Failures:
Root Cause (first observed failure):
[0]:
time : 2023-10-18_13:26:49
host : amax
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 50262)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Could you provide a detailed version of your torch-version? My environment is torch 1.10.0+cu111.
Hi! Thanks for your brilliant work!
After configuring the enviroment by requirements.txt, I prepared the pretraining and fintuning .tsv file. When I excuted the pretrain_polyformer_b.sh, log in corresponding direcory shows: train.py: error: unrecongnized arguments: --warmup-ratio=0.06.
Besides, evaluation failed when I excuted app.py, after downloading polyformer_b_refcoco.pt to ./weights.
2023-10-18 13:26:44 | INFO | ofa.evaluate | loading model(s) from ../../weights/polyformer_b_refcoco.pt Traceback (most recent call last): File "../../evaluate.py", line 185, in
cli_main()
File "../../evaluate.py", line 180, in cli_main
vis_dir=args.vis_dir, vis=args.vis, result_dir=args.result_dir
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/fairseq/distributed/utils.py", line 354, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "../../evaluate.py", line 81, in main
num_shards=cfg.checkpoint.checkpoint_shard_count,
File "/data/zdr/polygon-transformer/utils/checkpoint_utils.py", line 436, in load_model_ensemble_and_task
state = load_checkpoint_to_cpu(filename, arg_overrides)
File "/data/zdr/polygon-transformer/utils/checkpoint_utils.py", line 330, in load_checkpoint_to_cpu
state = torch.load(f, map_location=torch.device("cpu"))
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/serialization.py", line 600, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/serialization.py", line 242, in init
super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 50262) of binary: /opt/anaconda3/envs/zdr_polyformer/bin/python3 Traceback (most recent call last): File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
../../evaluate.py FAILED Failures: