Unable to train on 4-5 gtx 1070s

sfxworks commented 1 year ago

I've got 5 1070s here I'm trying to train on. The memory goes away quick initially. 40G of VRAM but only 64G of system memory. I added swap but I imagine this will take forever to load in. Are there other flags I can use to reduce this?

sfxworks commented 1 year ago

I can't quite tell. It seems to just be doing something with 2 python threads. The memory usage went way down so swap really isn't used anymore. but only two gpus have activity on them

sfxworks commented 1 year ago

what is it doing here?

sfxworks commented 1 year ago

I went to sigkill it. Nothing was happening that I could tell. No iops or anything in grafana

chiayewken commented 1 year ago

Hi, could you give more details, such as the training command used? It is not recommended to fit a model that is too big as it will cause excessive offloading of the model to cpu memory (assuming you are using fsdp), which is very slow

sfxworks commented 1 year ago

I was using the example command in the readme with the use_fsdp option:

python training.py --output_dir outputs/model/xl \
--use_fsdp \
--train_epochs 3 \
--max_source_length 64 \
--max_target_length 512 \
--data_path data/train.json \
--model_name_or_path "google/flan-t5-xl" \
--train_batch_size 1 \
--gradient_accumulation_steps 64

chiayewken commented 1 year ago

I see, it could be that the model is too large, causing slow cpu offload, or that fsdp is not working properly on your system. Does the same command work if you change to a smaller model eg google/flan-t5-base or google/flan-t5-large?

sfxworks commented 1 year ago

base just says bus error

(paca) root@anaconda-statefulset-0:~/flan-alpaca# python training.py --output_dir outputs/model/xl \
--use_fsdp \
--train_epochs 3 \
--max_source_length 64 \
--max_target_length 512 \
--data_path data/train.json \
--model_name_or_path "google/flan-t5-base" \
--train_batch_size 1 \
--gradient_accumulation_steps 64
Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-base
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████████████████| 1.40k/1.40k [00:00<00:00, 140kB/s]
Downloading model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████| 990M/990M [00:18<00:00, 54.3MB/s]
/opt/conda/envs/paca/lib/python3.8/site-packages/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(checkpoint_file, framework="pt") as f:
/opt/conda/envs/paca/lib/python3.8/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/opt/conda/envs/paca/lib/python3.8/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
/opt/conda/envs/paca/lib/python3.8/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
Downloading (…)neration_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 147/147 [00:00<00:00, 15.4kB/s]
{'orig_state_dict': 284}
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████| 2.54k/2.54k [00:00<00:00, 670kB/s]
Downloading spiece.model: 100%|█████████████████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 11.0MB/s]
Downloading (…)/main/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████| 2.42M/2.42M [00:00<00:00, 19.2MB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████| 2.20k/2.20k [00:00<00:00, 570kB/s]
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/5
[rank: 2] Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-base
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
[rank: 1] Global seed set to 42
[rank: 4] Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-base
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-base
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
[rank: 3] Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-base
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
/opt/conda/envs/paca/lib/python3.8/site-packages/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(checkpoint_file, framework="pt") as f:
/opt/conda/envs/paca/lib/python3.8/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/opt/conda/envs/paca/lib/python3.8/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
/opt/conda/envs/paca/lib/python3.8/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
/opt/conda/envs/paca/lib/python3.8/site-packages/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(checkpoint_file, framework="pt") as f:
/opt/conda/envs/paca/lib/python3.8/site-packages/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(checkpoint_file, framework="pt") as f:
/opt/conda/envs/paca/lib/python3.8/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/opt/conda/envs/paca/lib/python3.8/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/opt/conda/envs/paca/lib/python3.8/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
/opt/conda/envs/paca/lib/python3.8/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
/opt/conda/envs/paca/lib/python3.8/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
/opt/conda/envs/paca/lib/python3.8/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
/opt/conda/envs/paca/lib/python3.8/site-packages/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(checkpoint_file, framework="pt") as f:
/opt/conda/envs/paca/lib/python3.8/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/opt/conda/envs/paca/lib/python3.8/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
/opt/conda/envs/paca/lib/python3.8/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
{'orig_state_dict': 284}
[rank: 2] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/5
{'orig_state_dict': 284}
{'orig_state_dict': 284}
{'orig_state_dict': 284}
[rank: 3] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/5
[rank: 4] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/5
[rank: 1] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/5
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 5 processes
----------------------------------------------------------------------------------------------------

Bus error (core dumped)

sfxworks commented 1 year ago

This also results in an idling process

|    4   N/A  N/A    504567      C   /opt/conda/envs/paca/bin/python             440MiB |

sfxworks commented 1 year ago

And here's logs from the normal model:

(paca) root@anaconda-statefulset-0:~/flan-alpaca# python training.py --output_dir outputs/model/xl --use_fsdp --train_epochs 3 --max_source_length 64 --max_target_length 512 --data_path data/train.json --model_name_or_path "google/flan-t5-xl" --train_batch_size 1 --gradient_accumulation_steps 64
Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-xl
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
Downloading (…)l-00001-of-00002.bin: 100%|█████████████████████████████████████████████████████████████████████████| 9.45G/9.45G [01:25<00:00, 110MB/s]
Downloading (…)l-00002-of-00002.bin: 100%|████████████████████████████████████████████████████████████████████████| 1.95G/1.95G [00:35<00:00, 54.5MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:02<00:00, 61.04s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.54s/it]
Downloading (…)neration_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 147/147 [00:00<00:00, 14.2kB/s]
{'orig_state_dict': 560}
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████| 2.54k/2.54k [00:00<00:00, 601kB/s]
Downloading spiece.model: 100%|█████████████████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 11.0MB/s]
Downloading (…)/main/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████| 2.42M/2.42M [00:00<00:00, 19.4MB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████| 2.20k/2.20k [00:00<00:00, 615kB/s]
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/5
[rank: 2] Global seed set to 42
[rank: 1] Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-xl
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
[rank: 4] Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-xl
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-xl
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
[rank: 3] Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-xl
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
Loading checkpoint shards:   0%|                                                                                                 | 0/2 [00:00<?, ?it/s]command terminated with exit code 137

sfxworks commented 1 year ago

And then giving it the ram it needs via swap, I get the same bus error:

(paca) root@anaconda-statefulset-0:~/flan-alpaca#  python training.py --output_dir outputs/model/xl --use_fsdp --train_epochs 3 --max_source_length 64 --max_target_length 512 --data_path data/train.json --model_name_or_path "google/flan-t5-xl" --train_batch_size 1 --gradient_accumulation_steps 64
Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-xl
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.32s/it]
{'orig_state_dict': 560}
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/5
[rank: 1] Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-xl
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
[rank: 2] Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-xl
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
[rank: 3] Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-xl
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
[rank: 4] Global seed set to 42
"data_path":                   data/train.json
"debug":                       False
"gradient_accumulation_steps": 64
"learning_rate":               0.0005
"max_source_length":           64
"max_target_length":           512
"model_name_or_path":          google/flan-t5-xl
"output_dir":                  outputs/model/xl
"seed":                        42
"train_batch_size":            1
"train_epochs":                3
"use_compile":                 False
"use_fsdp":                    True
"use_gradient_checkpointing":  False
"use_lora":                    False
"weight_decay":                0.0
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [05:27<00:00, 163.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [05:31<00:00, 165.91s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [05:31<00:00, 165.66s/it]
{'orig_state_dict': 560}
{'orig_state_dict': 560}
{'orig_state_dict': 560}
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [05:31<00:00, 165.98s/it]
{'orig_state_dict': 560}
[rank: 2] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/5
[rank: 1] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/5
[rank: 3] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/5
[rank: 4] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/5
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 5 processes
----------------------------------------------------------------------------------------------------

Bus error (core dumped)
(paca)

declare-lab / flan-alpaca

Unable to train on 4-5 gtx 1070s #17