2U1 / Llama3.2-Vision-Finetune

An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.
Apache License 2.0
76 stars 12 forks source link

cannot create conda environment #9

Closed rothfels closed 3 weeks ago

rothfels commented 3 weeks ago

I tried running the full fine-tuning script on an 8xH100 from lambda labs but it errors with a segfault (code -11)

(venv) ubuntu@192-222-54-194:~/research$ bash scripts/finetune.sh
[2024-10-21 18:03:40,963] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:03:42,330] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-10-21 18:03:42,330] [INFO] [runner.py:607:main] cmd = /home/ubuntu/research/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None src/training/train.py --deepspeed scripts/zero3_offload.json --model_id meta-llama/Llama-3.2-11B-Vision-Instruct --data_path ./datasets/data.json --image_folder ./images --disable_flash_attn2 True --lora_enable False --tune_img_projector False --freeze_vision_tower True --freeze_llm True --bf16 True --output_dir output/test --num_train_epochs 1 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --learning_rate 1e-5 --projector_lr 1e-5 --vision_lr 2e-6 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --gradient_checkpointing True --report_to tensorboard --lazy_preprocess True --save_strategy steps --save_steps 200 --save_total_limit 10 --dataloader_num_workers 1
[2024-10-21 18:03:43,980] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:03:45,372] [INFO] [launch.py:139:main] 0 NCCL_IB_DISABLE=1
[2024-10-21 18:03:45,372] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-10-21 18:03:45,372] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-10-21 18:03:45,372] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-10-21 18:03:45,372] [INFO] [launch.py:164:main] dist_world_size=8
[2024-10-21 18:03:45,372] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-10-21 18:03:45,373] [INFO] [launch.py:256:main] process 2884481 spawned with command: ['/home/ubuntu/research/venv/bin/python', '-u', 'src/training/train.py', '--local_rank=0', '--deepspeed', 'scripts/zero3_offload.json', '--model_id', 'meta-llama/Llama-3.2-11B-Vision-Instruct', '--data_path', './datasets/data.json', '--image_folder', './images', '--disable_flash_attn2', 'True', '--lora_enable', 'False', '--tune_img_projector', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--bf16', 'True', '--output_dir', 'output/test', '--num_train_epochs', '1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-5', '--projector_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '1']
[2024-10-21 18:03:45,374] [INFO] [launch.py:256:main] process 2884482 spawned with command: ['/home/ubuntu/research/venv/bin/python', '-u', 'src/training/train.py', '--local_rank=1', '--deepspeed', 'scripts/zero3_offload.json', '--model_id', 'meta-llama/Llama-3.2-11B-Vision-Instruct', '--data_path', './datasets/data.json', '--image_folder', './images', '--disable_flash_attn2', 'True', '--lora_enable', 'False', '--tune_img_projector', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--bf16', 'True', '--output_dir', 'output/test', '--num_train_epochs', '1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-5', '--projector_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '1']
[2024-10-21 18:03:45,374] [INFO] [launch.py:256:main] process 2884483 spawned with command: ['/home/ubuntu/research/venv/bin/python', '-u', 'src/training/train.py', '--local_rank=2', '--deepspeed', 'scripts/zero3_offload.json', '--model_id', 'meta-llama/Llama-3.2-11B-Vision-Instruct', '--data_path', './datasets/data.json', '--image_folder', './images', '--disable_flash_attn2', 'True', '--lora_enable', 'False', '--tune_img_projector', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--bf16', 'True', '--output_dir', 'output/test', '--num_train_epochs', '1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-5', '--projector_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '1']
[2024-10-21 18:03:45,375] [INFO] [launch.py:256:main] process 2884484 spawned with command: ['/home/ubuntu/research/venv/bin/python', '-u', 'src/training/train.py', '--local_rank=3', '--deepspeed', 'scripts/zero3_offload.json', '--model_id', 'meta-llama/Llama-3.2-11B-Vision-Instruct', '--data_path', './datasets/data.json', '--image_folder', './images', '--disable_flash_attn2', 'True', '--lora_enable', 'False', '--tune_img_projector', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--bf16', 'True', '--output_dir', 'output/test', '--num_train_epochs', '1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-5', '--projector_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '1']
[2024-10-21 18:03:45,375] [INFO] [launch.py:256:main] process 2884485 spawned with command: ['/home/ubuntu/research/venv/bin/python', '-u', 'src/training/train.py', '--local_rank=4', '--deepspeed', 'scripts/zero3_offload.json', '--model_id', 'meta-llama/Llama-3.2-11B-Vision-Instruct', '--data_path', './datasets/data.json', '--image_folder', './images', '--disable_flash_attn2', 'True', '--lora_enable', 'False', '--tune_img_projector', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--bf16', 'True', '--output_dir', 'output/test', '--num_train_epochs', '1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-5', '--projector_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '1']
[2024-10-21 18:03:45,376] [INFO] [launch.py:256:main] process 2884486 spawned with command: ['/home/ubuntu/research/venv/bin/python', '-u', 'src/training/train.py', '--local_rank=5', '--deepspeed', 'scripts/zero3_offload.json', '--model_id', 'meta-llama/Llama-3.2-11B-Vision-Instruct', '--data_path', './datasets/data.json', '--image_folder', './images', '--disable_flash_attn2', 'True', '--lora_enable', 'False', '--tune_img_projector', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--bf16', 'True', '--output_dir', 'output/test', '--num_train_epochs', '1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-5', '--projector_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '1']
[2024-10-21 18:03:45,376] [INFO] [launch.py:256:main] process 2884487 spawned with command: ['/home/ubuntu/research/venv/bin/python', '-u', 'src/training/train.py', '--local_rank=6', '--deepspeed', 'scripts/zero3_offload.json', '--model_id', 'meta-llama/Llama-3.2-11B-Vision-Instruct', '--data_path', './datasets/data.json', '--image_folder', './images', '--disable_flash_attn2', 'True', '--lora_enable', 'False', '--tune_img_projector', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--bf16', 'True', '--output_dir', 'output/test', '--num_train_epochs', '1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-5', '--projector_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '1']
[2024-10-21 18:03:45,377] [INFO] [launch.py:256:main] process 2884488 spawned with command: ['/home/ubuntu/research/venv/bin/python', '-u', 'src/training/train.py', '--local_rank=7', '--deepspeed', 'scripts/zero3_offload.json', '--model_id', 'meta-llama/Llama-3.2-11B-Vision-Instruct', '--data_path', './datasets/data.json', '--image_folder', './images', '--disable_flash_attn2', 'True', '--lora_enable', 'False', '--tune_img_projector', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--bf16', 'True', '--output_dir', 'output/test', '--num_train_epochs', '1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-5', '--projector_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '1']
[2024-10-21 18:03:52,351] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:03:52,995] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-21 18:03:53,192] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:03:53,274] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:03:53,286] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:03:53,363] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:03:53,371] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:03:53,383] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:03:53,388] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-21 18:03:53,930] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-21 18:03:53,931] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-10-21 18:03:53,965] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-21 18:03:54,028] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-21 18:03:54,069] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-21 18:03:54,086] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-21 18:03:54,135] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-10-21 18:03:54,168] [INFO] [comm.py:652:init_distributed] cdb=None
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  8.66it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  7.67it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.13it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.31it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.47it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.45it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.34it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00,  6.48it/s]
Fatal Python error: Segmentation fault

Thread 0x00007fac165fc640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fafb98a81c0 (most recent call first):
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2421 in broadcast
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83 in wrapper
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 200 in broadcast
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632 in _fn
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224 in broadcast
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117 in log_wrapper
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1120 in _broadcast_model
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1201 in _configure_distributed_model
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 269 in __init__
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/deepspeed/__init__.py", line 193 in initialize
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1851 in _prepare_deepspeed
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1344 in prepare
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2207 in _inner_training_loop
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2052 in train
  File "/home/ubuntu/research/src/training/train.py", line 188 in train
  File "/home/ubuntu/research/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355 in wrapper
  File "/home/ubuntu/research/src/training/train.py", line 213 in <module>

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.enum, av.error, av.utils, av.option, av.descriptor, av.container.pyio, av.dictionary, av.format, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.pad, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, propcache._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, ujson, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message (total: 137)
Fatal Python error: Segmentation fault

[...]

[2024-10-21 18:04:15,410] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2884481
[2024-10-21 18:04:16,660] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2884482
[2024-10-21 18:04:19,682] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2884483
[2024-10-21 18:04:19,699] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2884484
[2024-10-21 18:04:21,704] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2884485
[2024-10-21 18:04:21,722] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2884486
[2024-10-21 18:04:21,738] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2884487
[2024-10-21 18:04:21,753] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 2884488
[2024-10-21 18:04:21,753] [ERROR] [launch.py:325:sigkill_handler] ['/home/ubuntu/research/venv/bin/python', '-u', 'src/training/train.py', '--local_rank=7', '--deepspeed', 'scripts/zero3_offload.json', '--model_id', 'meta-llama/Llama-3.2-11B-Vision-Instruct', '--data_path', './datasets/data.json', '--image_folder', './images', '--disable_flash_attn2', 'True', '--lora_enable', 'False', '--tune_img_projector', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--bf16', 'True', '--output_dir', 'output/test', '--num_train_epochs', '1', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '1e-5', '--projector_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '10', '--dataloader_num_workers', '1'] exits with return code = -11

I cannot reproduce the failure running the same script on a 1xH100.

I'm able to produce the same segfault running the Phi3-V funetuning from https://github.com/2U1/Phi3-Vision-Finetune , but still only on an 8xH100 machine (no error on a 1x).

rothfels commented 3 weeks ago

I was able to resolve this issue for the Phi3-Vision-Finetune repo by setting up a conda virtual environment with environment.yaml instead of venv/requirements.txt. This tells me that the segfault is coming from some bad combination of torch/cuda/deepspeed.

For this repo, you've included an environment.yaml, but when I run conda env create -f environment.yaml I get the following error:

Pip subprocess error:
ERROR: Ignored the following versions that require a different python version: 0.36.0 Requires-Python >=3.6,<3.10; 0.37.0 Requires-Python >=3.7,<3.10; 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.53.0 Requires-Python >=3.6,<3.10; 0.53.0rc1.post1 Requires-Python >=3.6,<3.10; 0.53.0rc2 Requires-Python >=3.6,<3.10; 0.53.0rc3 Requires-Python >=3.6,<3.10; 0.53.1 Requires-Python >=3.6,<3.10; 0.54.0 Requires-Python >=3.7,<3.10; 0.54.0rc2 Requires-Python >=3.7,<3.10; 0.54.0rc3 Requires-Python >=3.7,<3.10; 0.54.1 Requires-Python >=3.7,<3.10
ERROR: Could not find a version that satisfies the requirement onnxruntime-genai-cuda==0.3.0 (from versions: 0.4.0)
ERROR: No matching distribution found for onnxruntime-genai-cuda==0.3.0

failed

CondaEnvException: Pip failed
2U1 commented 3 weeks ago

@rothfels Thanks for the update. When I was testing, I think the env was mixed up, and I've exported to the yaml. I really appreciate for your help for setting the env. I'll merge the PR.

rothfels commented 3 weeks ago

@2U1 no problem.

In addition to those changes, the conda environment can't initialize on ubuntu without a few more things:

I'm not sure about the second two, but the first is coming from the mistralrs-cuda dep. (Tbh I'm not even sure what that's for. Can it be removed?)

Either way, here was the rest of what I needed to do to set up ubuntu if you want to mention it in the README:

# Install rustc
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Verify rust installation
rustc --version
cargo --version

# Install pkg-config and openssl
sudo apt update
sudo apt install -y libssl-dev pkg-config

# Verify openssl installation
pkg-config --modversion openssl
2U1 commented 3 weeks ago

Sorry for the issue, it can be removed. It was for the serving not for the training. I will clean up the env file bit more.

2U1 commented 3 weeks ago

@rothfels I think it should work now. The env file was a messed up version. I didn't realize it that becuase repos I made was fine. Thanks for letting me know.

rothfels commented 3 weeks ago

Thanks for fixing!

2U1 commented 3 weeks ago

I'll close this issue for resloving the wrong environment yaml.