Open Maggione opened 1 year ago
can you share the command you ran? did you use configs that came with the repo, or did you update them?
can you share the command you ran? did you use configs that came with the repo, or did you update them?
dora run -d solver=musicgen/musicgen_base_32khz model/lm/model_scale=small continue_from=//pretrained/facebook/musicgen-small conditioner=text2music
My slover config is as following:
# @package __global__
# This is the training loop solver
# for the base MusicGen model (text-to-music)
# on monophonic audio sampled at 32 kHz
defaults:
- musicgen/default
- /model: lm/musicgen_lm
- override /dset: audio/data
- _self_
autocast: true
autocast_dtype: float16
# EnCodec large trained on mono-channel music audio sampled at 32khz
# with a total stride of 640 leading to 50 frames/s.
# rvq.n_q=4, rvq.bins=2048, no quantization dropout
# (transformer_lm card and n_q must be compatible)
compression_model_checkpoint: //pretrained/facebook/encodec_32khz
channels: 1
sample_rate: 32000
deadlock:
use: true # deadlock detection
dataset:
batch_size: 8 # 32 GPUs
sample_on_weight: false # Uniform sampling all the way
sample_on_duration: false # Uniform sampling all the way
segment_duration: 30.0
generate:
lm:
use_sampling: true
top_k: 250
top_p: 0.0
optim:
epochs: 500
optimizer: dadam
lr: 1
ema:
use: true
updates: 10
device: cuda
logging:
log_tensorboard: true
schedule:
lr_scheduler: cosine
cosine:
warmup: 4000
lr_min_ratio: 0.0
cycle_length: 1.0
Hi, you can control the deadlock detector with the deadlock.*
config keys, e.g. either disable the deadlock detector using deadlock.use=false
or extend the timeout threshold with deadlock.timeout=<> # in seconds
.
Hi, you can control the deadlock detector with the deadlock.* config keys, e.g. either disable the deadlock detector using deadlock.use=false or extend the timeout threshold with deadlock.timeout=<> # in seconds.
@JadeCopet which option of these would you recommend? Disabling or extending timeout? I'm on (at most) a single node 8xGPU machine right now.
Hi, you can control the deadlock detector with the deadlock.* config keys, e.g. either disable the deadlock detector using deadlock.use=false or extend the timeout threshold with deadlock.timeout=<> # in seconds.
@JadeCopet which option of these would you recommend? Disabling or extending timeout? I'm on (at most) a single node 8xGPU machine right now.
You can disable it. You will always have the option to re-enable with extended timeout value later on.
I'm having a similar problem (with dora launch on slurm, in my case). Will disabling it actually allow training to progress? I mean, if there's actual deadlock somewhere, isn't it possible that it will just hang indefinitely? Is there any way of figuring out why it's deadlocking? Is it a data problem?
Another note on my situation is that the GPU utilization shoots straight up to 100% on all GPUs (2 nodes, 8xH100 each).
"will die on eval after 1 epoch. to get rid of the deadlock, comment out lines 478-487 in audiocraft/audiocraft/solvers/base.py"
When I train musicgen model using a small training set, the training process can proceed normally. However, when I switch to a larger training set, which includes about 20000 samples, an error occurs: Deadlock detector timed out, last stage was init How can I slove it?Thank you!