Hardware spec for finetuning >7B Llama

ChaoChungWu-Johnson commented 1 year ago

Hi, thanks for this nice repo!

I'm always facing kill signal and error code -11 after executing example/finetune.py with my own 10k dataset (text_only) on a single A100 40GB, CPU RAM 85GB server, so I was wondering this is because my CPU ram is not enough to run ? And with plans to finetuning larger Llama models with LMFlow, would you mind sharing the successful case of hardware spec you train with Llama 7B, 13B and 33B ? GPU type, required RAM, and the corresponding parameter settings would be very helpful :). Thanks again!

research4pan commented 1 year ago

Thanks for your interest in LMFlow! Yes, it is highly possible that it was caused by insufficient RAM. We've successfully run Llama 7B finetune in a RTX 3090 GPU, on a server equipped with around ~200GB RAM. However, this is the hardware setting of our server, less memory can also handle this type of experiments. For Llama 13B, you may need more GPU memory, such as V100 (32G). For Llama 33B, A6000 (48G) and A100 (40G, 80G) may be required.

Hope that answers your question 😄

ChaoChungWu-Johnson commented 1 year ago

hi @research4pan , thanks for the reply. does error code -11 imply insufficient CPU RAM? I change llama7B to gpt2 base model on the same hardware (single A100 40GB, with 85GB CPU RAM), and the error was still the same:

[2023-04-10 12:32:21,704] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-10 12:32:21,712] [INFO] [runner.py:550:main] cmd = /opt/conda/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune_with_martechQA_lora --model_name_or_path gpt2 --num_train_epochs 1 --learning_rate 2e-5 --dataset_path /workspace/sharing/johnsonwu/LMFlow/data/martechQA/train --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --validation_split_percentage 0 --use_lora 1 --lora_r 8 --logging_steps 20 --block_size 512 --do_train --output_dir /workspace/sharing/johnsonwu/LMFlow/output_models/finetune --overwrite_output_dir --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-04-10 12:32:23,504] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-10 12:32:23,504] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-10 12:32:23,504] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-10 12:32:23,504] [INFO] [launch.py:162:main] dist_world_size=1
[2023-04-10 12:32:23,504] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-10 12:32:26,512] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
04/10/2023 12:32:26 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
04/10/2023 12:32:27 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-17a0376a73bc462c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
[2023-04-10 12:32:29,958] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 0.16B parameters
/opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
trainable params: 294912 || all params: 124734720 || trainable%: 0.23643136409814364
04/10/2023 12:32:30 - WARNING - datasets.fingerprint - Parameter 'function'=<function HFDecoderModel.tokenize.<locals>.tokenize_function at 0x7ff1f48fba60> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
04/10/2023 12:32:30 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-17a0376a73bc462c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-1c80317fa3b1799d.arrow
04/10/2023 12:32:30 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-17a0376a73bc462c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-bbe2d282518ba636.arrow
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.6247336864471436 seconds
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.28094935417175293 seconds
[2023-04-10 12:33:17,568] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 6701
[2023-04-10 12:33:17,568] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune_with_martechQA_lora', '--model_name_or_path', 'gpt2', '--num_train_epochs', '1', '--learning_rate', '2e-5', '--dataset_path', '/workspace/sharing/johnsonwu/LMFlow/data/martechQA/train', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--validation_split_percentage', '0', '--use_lora', '1', '--lora_r', '8', '--logging_steps', '20', '--block_size', '512', '--do_train', '--output_dir', '/workspace/sharing/johnsonwu/LMFlow/output_models/finetune', '--overwrite_output_dir', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -11

ChaoChungWu-Johnson commented 1 year ago

and same error code when simply running scripts/run_finetune.sh 😢

research4pan commented 1 year ago

Hi! I am wondering if the script can run successfully with the officially provided datasets under data? Also, it would be nice if you can provide a piece of your own dataset, so we may check for you if the format is correct. In addition, you may also try removing the cached dataset under /root/.cache/huggingface/datasets, as sometimes corrupted dataset caches may result in strange errors. Thanks 🙏

ChaoChungWu-Johnson commented 1 year ago

hi @research4pan , thanks for the reply.

if the scripts run successfully with the offically provided datasets under data?

I think not :( . I simply ran sh scripts/run_finetune.py. I supposed it's running gpt2 on alpaca's dataset, right? the error code is the same:

(lmflow) root@myconsole:~/# sh scripts/run_finetune.sh 
[2023-04-11 15:05:26,630] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-11 15:05:26,639] [INFO] [runner.py:550:main] cmd = /opt/conda/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path gpt2 --dataset_path /workspace/sharing/johnsonwu/LMFlow/data/alpaca/train --output_dir /workspace/sharing/johnsonwu/LMFlow/output_models/finetune --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-04-11 15:05:28,449] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-11 15:05:28,450] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-11 15:05:28,450] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-11 15:05:28,450] [INFO] [launch.py:162:main] dist_world_size=1
[2023-04-11 15:05:28,450] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-11 15:05:31,511] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
04/11/2023 15:05:31 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-dc30bcf62aafb961/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Downloading data files: 100%|██████████████████████████| 1/1 [00:00<00:00, 7121.06it/s]
Extracting data files: 100%|████████████████████████████| 1/1 [00:00<00:00, 229.79it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-dc30bcf62aafb961/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
[2023-04-11 15:05:35,156] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 0.16B parameters
/opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
04/11/2023 15:05:35 - WARNING - datasets.fingerprint - Parameter 'function'=<function HFDecoderModel.tokenize.<locals>.tokenize_function at 0x7f5ac297ab80> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Running tokenizer on dataset:  48%|██▍  | 25000/52002 [00:06<00:06, 4003.97 examples/s][WARNING|tokenization_utils_base.py:3570] 2023-04-11 15:05:42,300 >> Token indices sequence length is longer than the specified maximum sequence length for this model (1490 > 1024). Running this sequence through the model will result in indexing errors
[WARNING|hf_decoder_model.py:282] 2023-04-11 15:05:42,301 >> ^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model.
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.6613805294036865 seconds
Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.2971947193145752 seconds
Parameter Offload: Total persistent parameters: 121344 in 98 params
[2023-04-11 15:06:44,537] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 6651
[2023-04-11 15:06:44,537] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'gpt2', '--dataset_path', '/workspace/sharing/johnsonwu/LMFlow/data/alpaca/train', '--output_dir', '/workspace/sharing/johnsonwu/LMFlow/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -11

and I tried to remove the cahced datasets, the results are the same. Would it possibly the problem with data ? since I can't directly run the. download.sh in data/ (connection refused), but I saw there're already train_52002.json and test_252.json in data/alpaca....

ChaoChungWu-Johnson commented 1 year ago

and here is a sample of our data for your reference. It's basically very similar to alpaca's

{
    "type":"text_only"
    "instances":[
        {'text':"Instruction: Generate a pun related to technology. Output: I'd tell you a joke about UDP, but I'm not sure if you'd get it."},
        {"text":"Instruction: Rewrite the following sentence to make it simpler for non-native English speakers: 'In view of the fact that he has not yet decided, we are unable to move forward.'. Input: In view of the fact that he has not yet decided, we are unable to move forward. Output: Since he hasn't made a decision, we can't move ahead."
        },
        {"text":"Instruction: Create a one-sentence product description for a new line of organic hand soap. Output: Our organic hand soap is specially formulated to gently cleanse and nourish your skin, using the power of nature."
        }
    ]
}

ChaoChungWu-Johnson commented 1 year ago

I found it may be the problem of insufficent CPU RAM. I change the device to none for parameters in configuration of zero3, and it seems proceeding with gpt2, but failed for 7B models on 8*V100 16GB with 624GB CPU RAM. but I have prepared around 624GB of CPU RAM for officially provided script to run gpt2 , why fully offload still failing?

I'm hoping I can get your insight about how much CPU RAM is required for different size of models (like 7B, 13B, 20B, 33B), it seems like a difficult bottleneck.

ChaoChungWu-Johnson commented 1 year ago

Also, would you mind sharing the configuration of this one?

We've successfully run Llama 7B finetune in a RTX 3090 GPU, on a server equipped with around ~200GB RAM

I try to run llama 7B finetune on 8*V100 16GB with 624GB. but failed with zero3 setting when setting offload to either none or cpu. Thank you very much!

snwen123 commented 1 year ago

Use deepspeed to evaluate the model's requirement for memory. For llama-7b model, zero2 requires a CPU RAM > 147G, and zero3 requires a CPU RAM > 166G. This may be the cause of CPU RAM issues. the code is following:

from transformers import AutoModel from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live

model = AutoModel.from_pretrained("model-name")

estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)

ChaoChungWu-Johnson commented 1 year ago

@snwen123 yes, I know there's an estimation by deepspeed. And that is why I'm so confusing: there's still memory problem with my setting. In factI have 624GB CPU RAM with 8* 16GB V100now, but I can't run llama-7B with run_finetune.sh (zero3) and according to deepspeed estimation for param offload = 'none', optimizer offload = 'cpu', and I set pin_memory = false, block_size=8, batch size=1 only. it should only takes about 295 GB CPU RAM and 14GB GPU RAM for each gpu core. this still failed with error code =-7 I'm wondering if there's still more configuration need to change in zero3? here is my running scrip and config:

deepspeed "--master_port=11000"\
    examples/finetune.py \
    --deepspeed configs/ds_config_zero3_full.json \
    --use_lora 1 \
    --lora_r 8 \
    --fp16 \
    --run_name finetune_with_martechQA_lora \
    --model_name_or_path hf_models/llama-7b-hf \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --dataset_path /data/train \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --validation_split_percentage 0 \
    --logging_steps 20 \
    --block_size 8 \
    --do_train \
    --output_dir output_models/finetune \
    --overwrite_output_dir \
    --ddp_timeout 72000 \
    --save_steps 5000 \
    --dataloader_num_workers 1

zero3 config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": false
        },
        "offload_param": {
            "device": "none",
            "pin_memory": false
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

would you mind having a check on it? Thanks!

ChaoChungWu-Johnson commented 1 year ago

What I also have tried: same error code (-7) occur when changing from llama 7b to facebook/galactica-1.3b (which. requires much few resources: CPU RAM: 58.79GB | GPU RAM: 3.14GB | offload_param=none, offload_optimizer=none, zero_init=0)

research4pan commented 1 year ago

Thanks for providing the detailed information!

If CPU RAM is the bottleneck, you may try ds_config_zero2.json to see if it solves the problem. Also, you may try shorter --block_size, e.g. 512, this will reduce the transformer length and significantly reduce the GPU memory consumption.

We've successfully finetuned llama-7b in RTX 3090 (24G) with both ds_config_zero2.json and ds_config_zero3.json, so I conjecture the problem here is caused by offloading: since V100 (16G) has less GPU memory, the offloading mechanism may become strange under that scenario.

It is a bit strange that gpt2 model finetuning failed on that setting. One reason may be caused by dataset. Could you please try the latest main branch? It includes the updated data downloading service. After removing the huggingface dataset cache and reinstalling our package pip install -e ., it should work.

If you encounter any further issues, please feel free to let us know. Thanks very much 🙏

ChaoChungWu-Johnson commented 1 year ago

@research4pan , ./scripts/run_finetune.py still failed with error code=-7 (and sigkill) after I git pulled the latest master branch, reinstalled package and removed dataset cache. What I only modify in scripts/run_finetune.py is to remove --bf16 since I'm not using Ampere GPU. and I think the alpaca data looks just fine:

Any idea how to run gpt2 version successfully?

research4pan commented 1 year ago

Thanks for providing more details. You may try --fp16 to see if it works. If any further problem occurs, please feel free to let us know. Thanks 🙏

ChaoChungWu-Johnson commented 1 year ago

Yes I did, the above results are all done by --fp16

research4pan commented 1 year ago

That's a bit strange. I am wondering if our official ./run_finetune.sh works in your server? If the example dataset works fine, then it may related to customized dataset formats. Otherwise, could you provide the detailed settings of your server, such as OS version, nvcc version, CUDA driver version, CUDA version, torch version, etc. so we may check that for you? Thanks 🙏

ChaoChungWu-Johnson commented 1 year ago

hi @research4pan , this is my settings of sever: to reproduce the error, I tried: git reset --hard and then sh ./scripts/run_finetune.sh

os version

# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

2.nvcc, cuda version

# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

torch version (from pip list)

torch                    2.0.0
transformers             4.28.0.dev0

RAM settings CPU RAM: 624 GB (623365828 kB) GPU: V100 16GB * 8

Hope the information is helpful for you... Thank you very much!

ChaoChungWu-Johnson commented 1 year ago

@research4pan ,after several trial and observation it looks like a no space on device error: self._semlock = _multiprocessing.SemLock( OSError: [Errno 28] No space left on device but i can't direclty change kern.posix.sem.max. is there similar problem or solution to that? Thanks

shizhediao commented 1 year ago

Hi, It seems that there is not sufficient disk space. Could you try to clean up the disk to prepare some space?

ChaoChungWu-Johnson commented 1 year ago

Hi @shizhediao , @research4pan , I've changed $tmpdir for the multiprocessing to store semLock, which solved the no space on device problem, but it failed as before, return code=-11

[2023-04-19 14:50:13,001] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-19 14:50:13,010] [INFO] [runner.py:550:main] cmd = /opt/conda/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path gpt2 --dataset_path /workspace/sharing/LMFlow/data/alpaca/train --output_dir /workspace/sharing/LMFlow/output_models/finetune --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-04-19 14:50:14,825] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.12.10-1
[2023-04-19 14:50:14,825] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.12.10-1
[2023-04-19 14:50:14,825] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-04-19 14:50:14,825] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.12.10-1+cuda11.6
[2023-04-19 14:50:14,825] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-04-19 14:50:14,825] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.12.10-1+cuda11.6
[2023-04-19 14:50:14,825] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10-1
[2023-04-19 14:50:14,825] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-19 14:50:14,825] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-19 14:50:14,825] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-19 14:50:14,825] [INFO] [launch.py:162:main] dist_world_size=1
[2023-04-19 14:50:14,825] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-19 14:50:18,295] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
04/19/2023 14:50:18 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
Downloading and preparing dataset json/default to /workspace/sharing/tmp/json/default-dc30bcf62aafb961/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Dataset json downloaded and prepared to /workspace/sharing/tmp/json/default-dc30bcf62aafb961/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
[2023-04-19 14:50:26,673] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 0.16B parameters
04/19/2023 14:50:27 - WARNING - datasets.fingerprint - Parameter 'function'=<function HFDecoderModel.tokenize.<locals>.tokenize_function at 0x7f9087ebf430> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
[1/3] /opt/conda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/include -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/envs/lmflow/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/include -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/envs/lmflow/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/opt/conda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -c /opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o cpu_adam.so
Time to load cpu_adam op: 35.629372358322144 seconds
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/include/THC -isystem /opt/conda/envs/lmflow/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /opt/conda/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/opt/conda/envs/lmflow/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Time to load utils op: 17.983821630477905 seconds
Parameter Offload: Total persistent parameters: 121344 in 98 params
[2023-04-19 14:52:31,977] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 7276
[2023-04-19 14:52:31,977] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'gpt2', '--dataset_path', '/workspace/sharing/LMFlow/data/alpaca/train', '--output_dir', '/workspace/sharing/LMFlow/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -11

is it related to OOM issue? but still no message shown and it's only gpt2 training script....

shizhediao commented 1 year ago

Are there any messages in log/train.log and log/train.err?

ChaoChungWu-Johnson commented 1 year ago

hi @shizhediao , the above message in between quotes is from log/train.log. and no, there's no any message in log/train.err

devinzhang91 commented 1 year ago

@research4pan ,after several trial and observation it looks like a no space on device error: self._semlock = _multiprocessing.SemLock( OSError: [Errno 28] No space left on device but i can't direclty change kern.posix.sem.max. is there similar problem or solution to that? Thanks

Are you running inside Docker and using multiple cards? If so, you can try adding --shm-size=128g when starting the container. reference form: https://stackoverflow.com/questions/44664900/oserror-errno-28-no-space-left-on-device-docker-but-i-have-space

shizhediao commented 1 year ago

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

OptimalScale / LMFlow

Hardware spec for finetuning >7B Llama #160