Hi, i am facing a problem like below:

python -m torch.distributed.launch --nproc_per_node=2 run.py --data_root data --batch_size 12 --dataset ade --name LWF --task 100-50 --step 0 --lr 0.01 --epochs 60 --method LWF /home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

INFO:rank1: Device: cuda:1 Traceback (most recent call last): File "run.py", line 390, in main(opts) File "run.py", line 116, in main logger = Logger(logdir_full, rank=rank, debug=opts.debug, summary=opts.visualize, step=opts.step) File "/home/cuong69/Desktop/MiB-master/utils/logger.py", line 15, in init import tensorboardX File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/init.py", line 5, in from .torchvis import TorchVis File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/torchvis.py", line 11, in from .writer import SummaryWriter File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/writer.py", line 15, in from .event_file_writer import EventFileWriter File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/event_file_writer.py", line 28, in from .proto import event_pb2 File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/event_pb2.py", line 15, in from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summarypb2 File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/summary_pb2.py", line 15, in from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensorpb2 File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/tensor_pb2.py", line 15, in from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resourcehandlepb2 File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 22, in serialized_pb=_b('\n(tensorboardX/proto/resource_handle.proto\x12\x0ctensorboardX\"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tB/\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01\xf8\x01\x01\x62\x06proto3') TypeError: new() got an unexpected keyword argument 'serialized_options' Filtering images... 0/2000 ... WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1651457 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1651456) of binary: /home/cuong69/anaconda3/envs/plop/bin/python Traceback (most recent call last): File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/cuong69/anaconda3/envs/plop/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-06-15_09:57:39 host : aaa-Z490-AORUS-MASTER rank : 0 (local_rank: 0) exitcode : 1 (pid: 1651456) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html I think it is related to version conflict...my gpu is RTX3090, therefore, i must use cuda 11.3. Please help me to solve the problem..Thank you!

fcdl94 / MiB

Can not implement run.py #62

run.py FAILED