RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8

Some weights of the model checkpoint at /home/sunpeng/AXJ/MRC/bert/bert-base-uncased were not used when initializing BertQueryNER: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']

This IS expected if you are initializing BertQueryNER from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertQueryNER from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertQueryNER were not initialized from the model checkpoint at /home/sunpeng/AXJ/MRC/bert/bert-base-uncased and are newly initialized: ['span_embedding.classifier1.weight', 'end_outputs.bias', 'span_embedding.classifier2.weight', 'span_embedding.classifier2.bias', 'end_outputs.weight', 'span_embedding.classifier1.bias', 'start_outputs.bias', 'start_outputs.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. /home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Checkpoint directory /home/sunpeng/AXJ/MRC/outputs/ace2005/warmup0lr2e-5_drop0.3_norm1.0_weight0.1_warmup0_maxlen128 exists and is not empty with save_top_k != 0.All files in this directory will be deleted when a checkpoint is saved! warnings.warn(*args, **kwargs) GPU available: True, used: True TPU available: False, using: 0 TPU cores CUDA_VISIBLE_DEVICES: [0,1] Using native 16bit precision. Some weights of the model checkpoint at /home/sunpeng/AXJ/MRC/bert/bert-base-uncased were not used when initializing BertQueryNER: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
This IS expected if you are initializing BertQueryNER from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertQueryNER from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertQueryNER were not initialized from the model checkpoint at /home/sunpeng/AXJ/MRC/bert/bert-base-uncased and are newly initialized: ['end_outputs.weight', 'span_embedding.classifier2.weight', 'end_outputs.bias', 'start_outputs.bias', 'span_embedding.classifier1.weight', 'span_embedding.classifier1.bias', 'span_embedding.classifier2.bias', 'start_outputs.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Using native 16bit precision. initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2 initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2 Traceback (most recent call last): File "/home/sunpeng/AXJ/MRC//train/mrc_ner_trainer.py", line 429, in main() File "/home/sunpeng/AXJ/MRC//train/mrc_ner_trainer.py", line 416, in main trainer.fit(model) File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn result = fn(self, *args, *kwargs) File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in fit results = self.accelerator_backend.spawn_ddp_children(model) File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 123, in spawn_ddp_children results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True) File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 164, in ddp_train self.trainer.is_slurm_managing_tasks File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 908, in init_ddp_connection torch_distrib.init_process_group(torch_backend, rank=global_rank, world_size=world_size) File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8 Traceback (most recent call last): File "/home/sunpeng/AXJ/MRC/train/mrc_ner_trainer.py", line 429, in main() File "/home/sunpeng/AXJ/MRC/train/mrc_ner_trainer.py", line 416, in main trainer.fit(model) File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn result = fn(self, args, **kwargs) File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1046, in fit self.accelerator_backend.train(model) File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 57, in train self.ddp_train(process_idx=self.task_idx, mp_queue=None, model=model) File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 164, in ddp_train self.trainer.is_slurm_managing_tasks File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 908, in init_ddp_connection torch_distrib.init_process_group(torch_backend, rank=global_rank, world_size=world_size) File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/sunpeng/.conda/envs/pytorch_gpu/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8

ShannonAI / mrc-for-flat-nested-ner

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8 #107