Use a single training script for single gpu, DDP and FSDP

carbonscott / maxie

Masked Autoencoder for X-ray Image Encoding (MAXIE)

Other

1 stars 4 forks source link

[cwang31@batch5 train]$ OMP_NUM_THREADS=1 python train.fsdp.py experiments/yaml/start-fsd.yaml NO distributed environment is required. RANK:0,LOCAL_RANK:0,WORLD_SIZE:1 --> total memory per gpu (GB) = 15.0 Updating patch_embeddings num_channels from 3 to 1 [RANK 0] Confguring model checkpoint... [RANK 0] Confguring model, optim, scheduler, training state checkpoint... [RANK 0] Epoch: 0%| | 0/5 [00:00<?, ?it/s] memory stats reset, ready to track | 0/4171842 [00:00<?, ?it/s] [RANK 0] Mini batch: 0%| | 0/4 [00:00<?, ?it/s] [RANK 0] Mini batch: 25%|�� | 1/4 [04:16<12:49, 256.62s/it

Works perfectly on one gpu:

[cwang31@batch3 train]$ python train.fsdp.py experiments/yaml/single-gpu.yaml
NO distributed environment is required.  RANK:0,LOCAL_RANK:0,WORLD_SIZE:1
--> total memory per gpu (GB) = 15.0
Updating patch_embeddings num_channels from 3 to 1
[RANK 0] Confguring model checkpoint...
[RANK 0] Confguring model, optim, scheduler, training state checkpoint...
[RANK 0] Epoch:   0%|                                                                                                                                                                     | 0/5 [00:00<?, ?it/s]memory stats reset, ready to track                                                                                                                                                 | 0/16687368 [00:00<?, ?it/s]
[RANK 0] Mini batch: 100%|��������������������������������������������������������������������������������������������������������������������������������������������������������| 1/1 [00:00<00:00,  1.01it/s]
[RANK 0] Eval(training set): 100%|������������������������������������������������������������������������������������������������������������������������������������������������| 1/1 [00:18<00:00, 18.07s/it]
[RANK 0] Eval(validation set): 100%|����������������������������������������������������������������������������������������������������������������������������������������������| 1/1 [00:16<00:00, 16.24s/it]
RANK 0 - Model loaded. set): 100%|������������������������������������������������������������������������������������������������������������������������������������������������| 1/1 [00:18<00:00, 18.07s/it]
RANK 0 - Optimizer loaded.et): 100%|����������������������������������������������������������������������������������������������������������������������������������������������| 1/1 [00:16<00:00, 16.23s/it]
RANK 0 - Scheduler loaded.
RANK 0 - Training state loaded.
RANK 0 - Checkpoint path loaded.

--> cuda max reserved memory = 10.2051
--> max reserved percentage = 68.03 %

--> cuda max memory allocated = 8.9894
--> max allocated percentage = 59.93 %

--> peak active memory = 8.9894
--> peak active memory 59.93 %

cudaMalloc retries = 0
cuda OOM = 0

                                                                                                                                                                                                                memory stats reset, ready to track                                                                                                                                  | 1/16687368 [00:40<188918:58:29, 40.76s/it]
[RANK 0] Mini batch: 100%|��������������������������������������������������������������������������������������������������������������������������������������������������������| 1/1 [00:24<00:00, 24.44s/it]
[RANK 0] Eval(training set): 100%|������������������������������������������������������������������������������������������������������������������������������������������������| 1/1 [00:14<00:00, 14.15s/it]
[RANK 0] Mini batch: 100%|��������������������������������������������������������������������������������������������������������������������������������������������������������| 1/1 [00:24<00:00, 24.43s/it]
[RANK 0] Eval(training set): 100%|������������������������������������������������������������������������������������������������������������������������������������������������| 1/1 [00:14<00:00, 14.14s/it]
[RANK 0] Eval(validation set):   0%|                                                                                                                                                      | 0/1 [00:00<?, ?it/s]

checkpoint: chkpt_saving_period: 1 directory: experiments/chkpts prefix: single-gpu path_chkpt_prev: null pretrain: null dataset: batch_size: 1 num_workers: 1 path_train: experiments/datasets/dataset.train.json path_eval: experiments/datasets/dataset.eval.json seg_size: 1 entry_per_cycle: 1 debug: true server_address: - localhost - 5000 transforms: norm: Rayonix: mean: 116.92 std: 22.89 epix10k2M: mean: 46.6 std: 98.3 jungfrau4M: mean: 593.17 std: 204.13 H_pad: 2048 W_pad: 2048 num_patch: 100 size_patch: 20 angle_max: 360 frac_shift_max: 0.1 downscale_factors: - 2 - 2 var_size_patch: 0.2 patch_size: 224 stride: 224 dist: backend: nccl uses_unique_world_seed: true dtype: float16 logging: directory: experiments/logs prefix: single-gpu level: debug loss: grad_accum_steps: 2 lr_scheduler: min_lr: 1.0e-07 total_iterations: 1000000 uses_prev: true warmup_iterations: 5 scheduler_step_period: 50 misc: max_epochs: 5 max_eval_iter: 1 compiles_model: false data_dump_on: false cpu_only : false model: name: facebook/vit-mae-base optim: grad_clip: 1.0 lr: 0.0002

carbonscott / maxie

Use a single training script for single gpu, DDP and FSDP #11