Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.39k stars 3.38k forks source link

PyTorch Lightning Trainer freezes and won't timeout even if minutes are set. #17033

Closed dsm-72 closed 1 year ago

dsm-72 commented 1 year ago

Bug description

Was trying to train a model with pl.Trainer. It goes for a few epochs, but after literally 2/3 epochs it kept freeze the kernel (couldn't even kill it). So I set max_time={'minutes':2} (following the documentation examples) and after 5 minutes it is still "going" strong. This happens on both CPU, and GPU (I tried on a 1080ti and a 3090).

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                 | Params
-----------------------------------------------
0 | model | NeuralODE            | 924   
1 | loss  | OptimalTransportLoss | 0     
2 | z_net | Sequential           | 30.3 K
-----------------------------------------------
31.2 K    Trainable params
0         Non-trainable params
31.2 K    Total params
0.125     Total estimated model params size (MB)
/path/to/conda_envs/my_env/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/path/to/conda_envs/my_env//lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1609: PossibleUserWarning: The number of training batches (2) is smaller than the logging interval Trainer(log_every_n_steps=5). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 1: 0%
0/2 [00:00<?, ?it/s, loss=2.23e+04, v_num=6]

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

awaelchli commented 1 year ago

Hey @dsm-72 Do you have a runnable piece of code you could share. I unfortunately can't guess what's wrong here without looking at the code. I suggest that you try to disable as many features/code as possible to narrow down where the issue is.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!