Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.4k stars 3.39k forks source link

The process freezed in loading model weights from checkpoint when multiprocessing #15910

Closed crazyn2 closed 1 year ago

crazyn2 commented 1 year ago

Bug description

The process freezed in loading model weights from checkpoint when multiprocessing

How to reproduce the bug

.py
# from ..optim import MemAETrainer
import pytorch_lightning as pl
import os
import sys
from pytorch_lightning import seed_everything
import argparse

sys.path.append(os.getcwd())
sys.path.append(os.path.dirname(__file__))
from models import CIFAR10LeNetSvdd
from models import CIFAR10LeNetAutoencoder
import time
from datasets import cifar10_dataset
from datasets import CIFAR10DataModel
from datetime import datetime

def cifar10_lenet(normal_class,
                  pre_epochs,
                  epochs,
                  seed,
                  log_path,
                  enable_progress_bar=False):
    seed_everything(seed, workers=True)
    log_path = log_path + datetime.now().strftime(
        '%Y-%m-%d-%H%M%S')
    cifar10 = CIFAR10DataModel(batch_size=64, normal_class=normal_class)
    auto_enc = CIFAR10LeNetAutoencoder()
    trainer = pl.Trainer(accelerator="gpu",
                         devices=1,
                         default_root_dir=log_path,
                         max_epochs=pre_epochs,
                         enable_progress_bar=enable_progress_bar)
    trainer.fit(model=auto_enc, datamodule=cifar10)
    trainer.test(datamodule=cifar10)
    at_enc_svdd = CIFAR10LeNetSvdd()
    at_enc_svdd.load_state_dict(auto_enc.state_dict())
    at_enc_svdd.init_center_c(auto_enc, cifar10.train_dataloader())
    trainer = pl.Trainer(accelerator="gpu",
                         devices=1,
                         default_root_dir=log_path,
                         max_epochs=epochs,
                         enable_progress_bar=enable_progress_bar)
    # model.svdd.init_center_c(model.train_set, memae)
    trainer.fit(model=at_enc_svdd, datamodule=cifar10)
    trainer.test(datamodule=cifar10)

if __name__ == '__main__':

    start_time = time.process_time()
    parser = argparse.ArgumentParser(description="Deep SVDD")
    parser.add_argument('--seed', type=int, default=0)
    parser.add_argument('--normal_class', type=int, default=0)
    parser.add_argument('--pre_epochs', type=int, default=0)
    parser.add_argument('--epochs', type=int, default=0)
    parser.add_argument("--progress_bar", action="store_true")
    parser.add_argument("--log_path", type=str)
    args = parser.parse_args()
    cifar10_lenet(normal_class=args.normal_class,
                  pre_epochs=args.pre_epochs,
                  epochs=args.epochs,
                  seed=args.seed,
                  enable_progress_bar=args.progress_bar,
                  log_path=args.log_path)

    end_time = time.process_time()
    m, s = divmod(end_time - start_time, 60)
    h, m = divmod(m, 60)
    print("process took %02d:%02d:%02d" % (h, m, s))

```sh
SCRIPT_DIR=$(cd $(dirname ${BASH_SOURCE[0]}); pwd)
seed=${2:-9}
n_epochs=2
pretrain_n_epochs=1
objective=pl_svdd/
log_dir=$SCRIPT_DIR/../bash-log/$objective
export CUDA_VISIBLE_DEVICES=0
    for i in {0..4}; do
        {
            python $SCRIPT_DIR/../main/cifar10_lenet.py --seed $seed --pre_epochs $pretrain_n_epochs --epochs $n_epochs --normal_class $i --log_path $log_dir
        } &
        sleep 4
        # {
        #     sleep 5
        #     echo "python main.py --dataset cifar10 --objective $objective --chnum_in 3 --n_epochs $n_epochs --seed $seed --pretrain_n_epochs $pretrain_n_epochs --mem_dim $mem_dim --normal_class $i --gpu_id 0 --learning_approach $learning_approach >> $log_file"
        # } &
    done;
    export CUDA_VISIBLE_DEVICES=1
    for i in {5..9}; do
        {
            python $SCRIPT_DIR/../main/cifar10_lenet.py --seed $seed --pre_epochs $pretrain_n_epochs --epochs $n_epochs --normal_class $i --log_path $log_dir
        } &
        sleep 4        

        # {
        #     sleep 5
        #     echo "python main.py --dataset cifar10 --objective $objective --chnum_in 3 --n_epochs $n_epochs --seed $seed --pretrain_n_epochs $pretrain_n_epochs --mem_dim $mem_dim --normal_class $i --gpu_id 1 --learning_approach $learning_approach >> $log_file"

        # } &
    done;

### Error messages and logs

| Name | Type | Params

0 | encoder | CIFAR10Encoder | 520 K 1 | decoder | CIFAR10Decoder | 284 K 2 | mse | MSELoss | 0

804 K Trainable params 0 Non-trainable params 804 K Total params 3.218 Total estimated model params size (MB) TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Missing logger folder: /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224939/lightning_logs 2022-12-05 22:49:41.124533: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. Trainer.fit stopped: max_epochs=1 reached. /home/zby/anaconda3/envs/machine-learning/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:134: UserWarning: .test(ckpt_path=None) was called without a model. The best model of the previous fit call will be used. You can pass .test(ckpt_path='best') to use the best model or .test(ckpt_path='last') to use the last model. If you pass a value, this warning will be silenced. rank_zero_warn( Restoring states from the checkpoint path at /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224925/lightning_logs/version_0/checkpoints/epoch=0-step=78.ckpt LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1] Loaded model weights from checkpoint at /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224925/lightning_logs/version_0/checkpoints/epoch=0-step=78.ckpt Global seed set to 9 GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Missing logger folder: /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224945/lightning_logs LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]

| Name | Type | Params

0 | encoder | CIFAR10Encoder | 520 K 1 | decoder | CIFAR10Decoder | 284 K 2 | mse | MSELoss | 0

804 K Trainable params 0 Non-trainable params 804 K Total params 3.218 Total estimated model params size (MB) Trainer.fit stopped: max_epochs=1 reached. /home/zby/anaconda3/envs/machine-learning/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:134: UserWarning: .test(ckpt_path=None) was called without a model. The best model of the previous fit call will be used. You can pass .test(ckpt_path='best') to use the best model or .test(ckpt_path='last') to use the last model. If you pass a value, this warning will be silenced. rank_zero_warn( Restoring states from the checkpoint path at /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224930/lightning_logs/version_0/checkpoints/epoch=0-step=78.ckpt LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1] Loaded model weights from checkpoint at /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224930/lightning_logs/version_0/checkpoints/epoch=0-step=78.ckpt LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]

| Name | Type | Params

0 | encoder | CIFAR10Encoder | 520 K 1 | decoder | CIFAR10Decoder | 284 K 2 | mse | MSELoss | 0

804 K Trainable params 0 Non-trainable params 804 K Total params 3.218 Total estimated model params size (MB) Trainer.fit stopped: max_epochs=1 reached. /home/zby/anaconda3/envs/machine-learning/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:134: UserWarning: .test(ckpt_path=None) was called without a model. The best model of the previous fit call will be used. You can pass .test(ckpt_path='best') to use the best model or .test(ckpt_path='last') to use the last model. If you pass a value, this warning will be silenced. rank_zero_warn( Restoring states from the checkpoint path at /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224934/lightning_logs/version_0/checkpoints/epoch=0-step=78.ckpt LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1] Loaded model weights from checkpoint at /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224934/lightning_logs/version_0/checkpoints/epoch=0-step=78.ckpt Trainer.fit stopped: max_epochs=1 reached. /home/zby/anaconda3/envs/machine-learning/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:134: UserWarning: .test(ckpt_path=None) was called without a model. The best model of the previous fit call will be used. You can pass .test(ckpt_path='best') to use the best model or .test(ckpt_path='last') to use the last model. If you pass a value, this warning will be silenced. rank_zero_warn( Restoring states from the checkpoint path at /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224939/lightning_logs/version_0/checkpoints/epoch=0-step=78.ckpt LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1] Loaded model weights from checkpoint at /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224939/lightning_logs/version_0/checkpoints/epoch=0-step=78.ckpt Trainer.fit stopped: max_epochs=1 reached. /home/zby/anaconda3/envs/machine-learning/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:134: UserWarning: .test(ckpt_path=None) was called without a model. The best model of the previous fit call will be used. You can pass .test(ckpt_path='best') to use the best model or .test(ckpt_path='last') to use the last model. If you pass a value, this warning will be silenced. rank_zero_warn( Restoring states from the checkpoint path at /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224945/lightning_logs/version_0/checkpoints/epoch=0-step=78.ckpt LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1] Loaded model weights from checkpoint at /home/zby/Workspaces/anomaly-detection/bash-log/pl_svdd/2022-12-05-224945/lightning_logs/version_0/checkpoints/epoch=0-step=78.ckpt


### Environment

More info

No response

crazyn2 commented 1 year ago

its my own fault, sorry