amazon-science / polygon-transformer

Apache License 2.0
128 stars 8 forks source link

Errors in executing Pretraining & Evaluation #17

Closed zebbsb closed 4 months ago

zebbsb commented 12 months ago

Hi! Thanks for your brilliant work!

  1. After configuring the enviroment by requirements.txt, I prepared the pretraining and fintuning .tsv file. When I excuted the pretrain_polyformer_b.sh, log in corresponding direcory shows: train.py: error: unrecongnized arguments: --warmup-ratio=0.06.

  2. Besides, evaluation failed when I excuted app.py, after downloading polyformer_b_refcoco.pt to ./weights.

2023-10-18 13:26:44 | INFO | ofa.evaluate | loading model(s) from ../../weights/polyformer_b_refcoco.pt Traceback (most recent call last): File "../../evaluate.py", line 185, in cli_main() File "../../evaluate.py", line 180, in cli_main vis_dir=args.vis_dir, vis=args.vis, result_dir=args.result_dir File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/fairseq/distributed/utils.py", line 354, in call_main distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/fairseq/distributed/utils.py", line 328, in distributed_main main(cfg, **kwargs) File "../../evaluate.py", line 81, in main num_shards=cfg.checkpoint.checkpoint_shard_count, File "/data/zdr/polygon-transformer/utils/checkpoint_utils.py", line 436, in load_model_ensemble_and_task state = load_checkpoint_to_cpu(filename, arg_overrides) File "/data/zdr/polygon-transformer/utils/checkpoint_utils.py", line 330, in load_checkpoint_to_cpu state = torch.load(f, map_location=torch.device("cpu")) File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/serialization.py", line 600, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/serialization.py", line 242, in init super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 50262) of binary: /opt/anaconda3/envs/zdr_polyformer/bin/python3 Traceback (most recent call last): File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in main() File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/anaconda3/envs/zdr_polyformer/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../evaluate.py FAILED Failures:

Root Cause (first observed failure): [0]: time : 2023-10-18_13:26:49 host : amax rank : 0 (local_rank: 0) exitcode : 1 (pid: 50262) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html Could you provide a detailed version of your torch-version? My environment is torch 1.10.0+cu111.
joellliu commented 12 months ago

Hi, can you try torch 1.13.1+cu117?