Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.42k stars 3.39k forks source link

Auto restart machine when training #16850

Closed phamkhactu closed 1 year ago

phamkhactu commented 1 year ago

Bug description

Thanks for great repo. I want to finetune whisper model following the tutorial but in my machine have 2card 3090 Ti. The model start training, but auto hang out and restart machine I think model work well with one card in machine, but greater cause issue

How to reproduce the bug

No response

Error messages and logs

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

/home/tupk/anaconda3/envs/dl/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:613: UserWarning: Checkpoint directory /home/tupk/tupk/ASR/whisper/artifacts/checkpoint exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
/home/tupk/anaconda3/envs/dl/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

  | Name    | Type             | Params
---------------------------------------------
0 | model   | Whisper          | 71.8 M
1 | loss_fn | CrossEntropyLoss | 0     
---------------------------------------------
52.0 M    Trainable params
19.8 M    Non-trainable params
71.8 M    Total params

Environment

More info

No response

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

ayansengupta17 commented 1 year ago

Any update on this?