Open LT1st opened 8 months ago
We have checked that following the instructions in the README for installing dependencies and preparing the dataset does not result in such an error. You need to ensure that when modifying the code, essential parameter update processes are not removed. For example, removing the line ddp_logger.update(**computed_result)
in train.py
will reproduce your error.
Thank you for your suggestion, it worked for me!
I have another question regarding the application of this method. Is it be able to be used in image to image translation? If so, which part should I do? Currently I am using the depth estimation pipeline, with depth upper bound removed. But I am not sure if there is anything else that I should do?
Thank you for your advice. Best wishs!
Faced some errors in validation part.
Traceback (most recent call last): File "train.py", line 432, in <module> main() File "train.py", line 173, in main results_dict, loss_val = validate(val_loader, model, criterion_d, File "train.py", line 424, in validate result_metrics[key] = ddp_logger.meters[key].global_avg File "/home/spai/code/SD/meta-prompts/depth/utils.py", line 68, in global_avg return self.total / self.count ZeroDivisionError: float division by zero Traceback (most recent call last): File "train.py", line 432, in <module> main() File "train.py", line 173, in main results_dict, loss_val = validate(val_loader, model, criterion_d, File "train.py", line 424, in validate result_metrics[key] = ddp_logger.meters[key].global_avg File "/home/spai/code/SD/meta-prompts/depth/utils.py", line 68, in global_avg return self.total / self.count ZeroDivisionError: float division by zero ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1056935) of binary: /home/spai/anaconda3/envs/metap/bin/python3 Traceback (most recent call last): File "/home/spai/anaconda3/envs/metap/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/spai/anaconda3/envs/metap/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/spai/anaconda3/envs/metap/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-03-06_20:47:35 host : spai-WS-E900-G4-WS980T rank : 1 (local_rank: 1) exitcode : 1 (pid: 1056936) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-06_20:47:35 host : spai-WS-E900-G4-WS980T rank : 0 (local_rank: 0) exitcode : 1 (pid: 1056935) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
Can you tell me how to solve it?
i found a similar error, could u tell me how did u solve it? thank u !
Faced some errors in validation part.
Can you tell me how to solve it?