Closed zouharvi closed 1 year ago
Hi all, I would like to confirm this, I have the same issue with the above technology stack.
Hey @zouharvi @maxiek0071. Can you try the linked PR and let me know if that works (if it does not, post the error trace)?
You can install it like this:
python -m pip install git+https://github.com/Unbabel/COMET.git@refs/pull/160/head
Hi @BramVanroy,
I install comet from the branch you specified, and now I'm getting a similar error for EvaluationLoop.
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Added key: store_based_barrier_key:1 to store for rank: 1
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Added key: store_based_barrier_key:1 to store for rank: 2
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Added key: store_based_barrier_key:1 to store for rank: 3
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
| Name | Type | Params
-----------------------------------------------------------
0 | encoder | XLMREncoder | 558 M
1 | layerwise_attention | LayerwiseAttention | 26
2 | train_metrics | RegressionMetrics | 0
3 | val_metrics | ModuleList | 0
4 | estimator | FeedForward | 10.5 M
-----------------------------------------------------------
10.5 M Trainable params
558 M Non-trainable params
569 M Total params
1,138.661 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:208: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
rank_zero_warn(
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Traceback (most recent call last):
File "/home/ubuntu/venv-comet-3.10/bin/comet-train", line 8, in <module>
sys.exit(train_command())
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/comet/cli/train.py", line 192, in train_command
trainer.fit(model)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
mp.start_processes(
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function
results = function(*args, **kwargs)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1204, in _run_train
self._run_sanity_check()
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_sanity_check
val_loop.run()
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
self.on_run_start(*args, **kwargs)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 84, in on_run_start
self._data_fetcher = iter(data_fetcher)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 178, in __iter__
self.dataloader_iter = iter(self.dataloader)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
return self._get_iterator()
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1034, in __init__
w.start()
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/local/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/usr/local/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/local/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/local/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/local/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'EvaluationLoop.advance.<locals>.batch_to_device'
I suppose further adjustments are necessary.
@maxiek0071 I've been looking at this over lunch and I have made some progress, but not enough I believe. I do not have the time/patience currently to dig deeper into the idiosyncracies of PyTorch Lightning (where the issue lies) but I've written about what the issue is and what some one with more experience can do to fix it, in this PR: https://github.com/Unbabel/COMET/pull/160#issuecomment-1669470367
So perhaps if you share that PR with your network, other people may chime in and we can quickly solve it. But I cannot dig deeper into this for now, sorry! Maybe @ricardorei has some ideas.
Thanks @BramVanroy for your help, I appreciate it! I will first evaluate how not using encoder fine-tuning impacts the QE quality (#158). If the speed stays at 13.98it/s throughout training, it takes about 12-15h for 3-4 epochs for me.
Could @ricardorei confirm that they executed comet-train
on multi-GPUs? What was their Python venv and CUDA version?
Hi all! I'll look into this today.
I had this fixed before but Pytorch-Lightning likes to changes things. Maybe its just a quick fix... Like Bram said in his PR I think the problem is with torchmetrics.
I updated lightning and metrics and I tested multGPU training and it was working. I used strategy: ddp
and devices: 2
and everything went well.
Please give it a try
Use the latest version 2.1.0
Hi @ricardorei, I have just checked with this version, and I can execute training on multiple GPUs. Thank you for your help!
Thanks, @ricardorei 🙂
I am attempting to run
comet-train
with multiple GPUs.Command (abbreviated):
Config (abbreviated):
Output with error (abbreviated):
I'm using NVIDIA A10G GPUs and the following software versions: