epic-kitchens / epic-sounds-annotations

Splits for epic-sounds dataset
69 stars 5 forks source link

Training problem #8

Closed haoshuai714 closed 1 year ago

haoshuai714 commented 1 year ago

when i train this code, i have this bug: [03/23 06:27:05][DEBUG] connectionpool.py-1003: Starting new HTTPS connection (1): o151352.ingest.sentry.io:443 [03/23 06:27:06][DEBUG] retry.py-594: Incremented Retry for (url='/api/5288891/envelope/'): Retry(total=2, connect=None, read=None, redirect=None, status=None) [03/23 06:27:06][WARNING] connectionpool.py-812: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /api/5288891/envelope/ [03/23 06:27:06][DEBUG] connectionpool.py-1003: Starting new HTTPS connection (2): o151352.ingest.sentry.io:443 [03/23 06:27:07][DEBUG] retry.py-594: Incremented Retry for (url='/api/5288891/envelope/'): Retry(total=1, connect=None, read=None, redirect=None, status=None) [03/23 06:27:07][WARNING] connectionpool.py-812: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /api/5288891/envelope/ [03/23 06:27:07][DEBUG] connectionpool.py-1003: Starting new HTTPS connection (3): o151352.ingest.sentry.io:443 [03/23 06:27:08][DEBUG] retry.py-594: Incremented Retry for (url='/api/5288891/envelope/'): Retry(total=0, connect=None, read=None, redirect=None, status=None) [03/23 06:27:08][WARNING] connectionpool.py-812: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))': /api/5288891/envelope/ [03/23 06:27:08][DEBUG] connectionpool.py-1003: Starting new HTTPS connection (4): o151352.ingest.sentry.io:443 Traceback (most recent call last): File "/data/EPIC/audio_model/tools/run_net.py", line 31, in main() File "/data/EPIC/audio_model/tools/run_net.py", line 23, in main launch_job(cfg=cfg, init_method=args.init_method, func=train) File "/data/EPIC/audio_model/slowfast/utils/misc.py", line 244, in launch_job func(cfg=cfg) File "/data/EPIC/audio_model/tools/train_net.py", line 383, in train train_epoch( File "/data/EPIC/audio_model/tools/train_net.py", line 80, in train_epoch preds = model(inputs) File "/common-data/yifan.yang/environments/BEVFusion/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/data/EPIC/audio_model/slowfast/models/auditory_slowfast.py", line 316, in forward x = self.s1(x) File "/common-data/yifan.yang/environments/BEVFusion/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/data/EPIC/audio_model/slowfast/models/helpers/stem_helper_2d.py", line 99, in forward x[pathway] = m(x[pathway]) File "/common-data/yifan.yang/environments/BEVFusion/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/data/EPIC/audio_model/slowfast/models/helpers/stem_helper_2d.py", line 181, in forward x = self.bn(x) File "/common-data/yifan.yang/environments/BEVFusion/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/common-data/yifan.yang/environments/BEVFusion/anaconda3/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 732, in forward world_size = torch.distributed.get_world_size(process_group) File "/common-data/yifan.yang/environments/BEVFusion/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 845, in get_world_size return _get_group_size(group) File "/common-data/yifan.yang/environments/BEVFusion/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 306, in _get_group_size default_pg = _get_default_group() File "/common-data/yifan.yang/environments/BEVFusion/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. wandb: Waiting for W&B process to finish... (failed 1). wandb: You can sync this run to the cloud by running: wandb: wandb sync /data/EPIC/audio_model/wandb/offline-run-20230323_062605-nsoqhuto wandb: Find logs at: ./wandb/offline-run-20230323_062605-nsoqhuto/logs**

JacobChalk commented 1 year ago

This is related to the Weights and Biases library (W&B). For online runs, W&B requires an internet connection on the node/machine you are running the code on. If this is not possible, and you still wish to track the run using W&B, you should update the command line argument with: WANB_MODE=offline python ... in order to run an offline version. This offline version can then be synced to you W&B account with wandb sync --sync-all.

Alternatively, if you don't require tracking the runs with W&B, you can disable this feature entirely by updating the config with WANDB.ENABLE False, either as an additional command line argument, or by editing the .yaml config files.

NOTE: We have just updated the configs to disable W&B by default