Open Petrichoeur opened 1 year ago
hi, @Petrichoeur can you check the torch version (gpu or cpu)?
Do you honestly think someone enjoys reading your error formatted as a title?
hi, @Petrichoeur can you check the torch version (gpu or cpu)?
2.0 version , and torch.cuda.is_available() return True so gpu it is.
Do you honestly think someone enjoys reading your error formatted as a title?
I justt copy/paste the error, the formating is surely due to a lost # in errors. But hey, what a good developer you are, being petty instead of helping, it's exactly the right mindset for open-source right ?
When i run on a single GPU i get the following errors:
/usr/local/lib/python3.8/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.8/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from
load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( | distributed init (rank 0): env://, gpu 0 [15:12:33.524915] job dir: /root/IAOps/LLaMA-Adapter/alpaca_finetuning_v1 [15:12:33.525095] Namespace(accum_iter=1, adapter_layer=30, adapter_len=10, batch_size=2, blr=0.009, data_path='/root/IAOps/vigogne/data/vigogne_data_cleaned.json', device='cuda', dist_backend='nccl', dist_on_itp=False, dist_url='env://', distributed=True, epochs=5, gpu=0, llama_model_path='/root/IAOps/llama/llama/', local_rank=-1, log_dir='./output_dir', lr=None, max_seq_len=512, min_lr=0.0, model='Llama7B_adapter', num_workers=10, output_dir='./checkpoint/', pin_mem=True, rank=0, resume='', seed=0, start_epoch=0, warmup_epochs=2, weight_decay=0.02, world_size=1) [15:12:34.675550] <main.InstructionDataset object at 0x7f3ca4fdfc10> [15:12:34.675645] <main.InstructionDataset object at 0x7f3d4859ed00> [15:12:34.675740] Sampler_train = <torch.utils.data.distributed.DistributedSampler object at 0x7f3ca4fd4e50> 2023-04-28 15:12:34.678237: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 3577904) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 11, infinetuning.py FAILED
Failures: