Open 1chizhang opened 7 months ago
From Catalyst's Runner.train
, code and DDP documentation, it looks like one of these should work:
compressai-train ++engine.ddp=True
OR
compressai-train ++engine.engine="ddp"
OR
from catalyst import dl
runner.train(engine=dl.DistributedDataParallelEngine())`
P.S. It should automatically use all available detected GPUs. But if not, you may need to export CUDA_VISIBLE_DEVICES="0,1,2,3"
to enable GPU/CUDA devices 0, 1, 2, and 3 beforehand.
From Catalyst's
Runner.train
, code and DDP documentation, it looks like one of these should work:compressai-train ++engine.ddp=True
OR
compressai-train ++engine.engine="ddp"
OR
from catalyst import dl runner.train(engine=dl.DistributedDataParallelEngine())`
P.S. It should automatically use all available detected GPUs. But if not, you may need to
export CUDA_VISIBLE_DEVICES="0,1,2,3"
to enable GPU/CUDA devices 0, 1, 2, and 3 beforehand.
The first one works, but it raises more problems that I could not solve. Can you provide versions of all the packages you used for testing?
[2024-02-12 12:40:55,693][aim.sdk.reporter][INFO] - creating RunStatusReporter for 8c46505939814d73af135df2
[2024-02-12 12:40:55,693][aim.sdk.reporter][INFO] - starting from: {}
[2024-02-12 12:40:55,694][aim.sdk.reporter][INFO] - starting writer thread for <aim.sdk.reporter.RunStatusReporter object at 0x7f8db06d3070>
Error executing job with overrides: ['++criterion.lmbda=0.035', '++engine.ddp=True']
Traceback (most recent call last):
File "/home/zhan5096/Anaconda/enter/envs/Trainer/bin/compressai-train", line 8, in
I'm guessing the DB
object from aim needs to move across processes, but that object does not have a __reduce__
defined for pickling/serializing.
The package requirements are in pyproject.toml. The exact versions used are specified in poetry.lock. Here is an exported requirements.txt
:
Still, multi-gpu have another question
CUDA_VISIBLE_DEVICES=0,1 compressai-train --config-name="example" ++criterion.lmbda=0.035
will report:
DataParallelEngine.prepare_model() got an unexpected keyword argument 'device_placement'
catalyst
in this issue suggest to use accelerate==0.5.1
. While the current version is 0.15.0
It seems that catalyst
and aim
is not very friendly at this time. :cry:
Hi, thanks for your work, I recently wanted to try multi-GPU training, but I realized that its default is to use DataParalle instead of DDP, can you tell me where I can switch to DDP mode?