OpenGVLab / LLaMA-Adapter

[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
GNU General Public License v3.0
5.74k stars 374 forks source link

Errors throwns when finetuning #22

Open Petrichoeur opened 1 year ago

Petrichoeur commented 1 year ago

When i run on a single GPU i get the following errors:

/usr/local/lib/python3.8/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.8/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( | distributed init (rank 0): env://, gpu 0 [15:12:33.524915] job dir: /root/IAOps/LLaMA-Adapter/alpaca_finetuning_v1 [15:12:33.525095] Namespace(accum_iter=1, adapter_layer=30, adapter_len=10, batch_size=2, blr=0.009, data_path='/root/IAOps/vigogne/data/vigogne_data_cleaned.json', device='cuda', dist_backend='nccl', dist_on_itp=False, dist_url='env://', distributed=True, epochs=5, gpu=0, llama_model_path='/root/IAOps/llama/llama/', local_rank=-1, log_dir='./output_dir', lr=None, max_seq_len=512, min_lr=0.0, model='Llama7B_adapter', num_workers=10, output_dir='./checkpoint/', pin_mem=True, rank=0, resume='', seed=0, start_epoch=0, warmup_epochs=2, weight_decay=0.02, world_size=1) [15:12:34.675550] <main.InstructionDataset object at 0x7f3ca4fdfc10> [15:12:34.675645] <main.InstructionDataset object at 0x7f3d4859ed00> [15:12:34.675740] Sampler_train = <torch.utils.data.distributed.DistributedSampler object at 0x7f3ca4fd4e50> 2023-04-28 15:12:34.678237: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 3577904) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 11, in load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetuning.py FAILED

Failures:

--------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-04-28_15:12:39 host : dtk-design-102.mlops.iaas.cagip.group.gca rank : 0 (local_rank: 0) exitcode : -11 (pid: 3577904) error_file: traceback : Signal 11 (SIGSEGV) received by PID 3577904 I really don't understand how it can failed, i used different --nproc_per_node , from 1 to 8, even remove the argument, still the same error. I'm more used to work with tensorflow so i don't really know where it failed but i guess the code is supposed to work on multiple GPU and a single GPU training recquire some modification inside the code ? I work on ubuntu 20.04 with a A 100 GPU . thanks !
aojunzz commented 1 year ago

hi, @Petrichoeur can you check the torch version (gpu or cpu)?

aristotaloss commented 1 year ago

Do you honestly think someone enjoys reading your error formatted as a title?

Petrichoeur commented 1 year ago

hi, @Petrichoeur can you check the torch version (gpu or cpu)?

2.0 version , and torch.cuda.is_available() return True so gpu it is.

Petrichoeur commented 1 year ago

Do you honestly think someone enjoys reading your error formatted as a title?

I justt copy/paste the error, the formating is surely due to a lost # in errors. But hey, what a good developer you are, being petty instead of helping, it's exactly the right mindset for open-source right ?