/usr/local/lib/python3.8/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.8/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( | distributed init (rank 0): env://, gpu 0 [15:12:33.524915] job dir: /root/IAOps/LLaMA-Adapter/alpaca_finetuning_v1 [15:12:33.525095] Namespace(accum_iter=1, adapter_layer=30, adapter_len=10, batch_size=2, blr=0.009, data_path='/root/IAOps/vigogne/data/vigogne_data_cleaned.json', device='cuda', dist_backend='nccl', dist_on_itp=False, dist_url='env://', distributed=True, epochs=5, gpu=0, llama_model_path='/root/IAOps/llama/llama/', local_rank=-1, log_dir='./output_dir', lr=None, max_seq_len=512, min_lr=0.0, model='Llama7B_adapter', num_workers=10, output_dir='./checkpoint/', pin_mem=True, rank=0, resume='', seed=0, start_epoch=0, warmup_epochs=2, weight_decay=0.02, world_size=1) [15:12:34.675550] <main.InstructionDataset object at 0x7f3ca4fdfc10> [15:12:34.675645] <main.InstructionDataset object at 0x7f3d4859ed00> [15:12:34.675740] Sampler_train = <torch.utils.data.distributed.DistributedSampler object at 0x7f3ca4fd4e50> 2023-04-28 15:12:34.678237: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 3577904) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 11, in load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')() File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main run(args) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetuning.py FAILED

Failures:

--------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-04-28_15:12:39 host : dtk-design-102.mlops.iaas.cagip.group.gca rank : 0 (local_rank: 0) exitcode : -11 (pid: 3577904) error_file: traceback : Signal 11 (SIGSEGV) received by PID 3577904 I really don't understand how it can failed, i used different --nproc_per_node , from 1 to 8, even remove the argument, still the same error. I'm more used to work with tensorflow so i don't really know where it failed but i guess the code is supposed to work on multiple GPU and a single GPU training recquire some modification inside the code ? I work on ubuntu 20.04 with a A 100 GPU . thanks !

aojunzz commented 1 year ago

hi, @Petrichoeur can you check the torch version (gpu or cpu)?

aristotaloss commented 1 year ago

Do you honestly think someone enjoys reading your error formatted as a title?

Petrichoeur commented 1 year ago

hi, @Petrichoeur can you check the torch version (gpu or cpu)?

2.0 version , and torch.cuda.is_available() return True so gpu it is.

Petrichoeur commented 1 year ago

Do you honestly think someone enjoys reading your error formatted as a title?

I justt copy/paste the error, the formating is surely due to a lost # in errors. But hey, what a good developer you are, being petty instead of helping, it's exactly the right mindset for open-source right ?

OpenGVLab / LLaMA-Adapter

Errors throwns when finetuning #22

finetuning.py FAILED