Open chris-hatton opened 1 year ago
Related:
when I try to run the convert.py
script I get:
File "vanilla-llama/llama/model.py", line 99, in __init__
).cuda()
File "vanilla-llama/lib/python3.10/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
I tried getting past the OP issue (RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
) by passing device_map=None
found and fixed this issue https://github.com/huggingface/accelerate/pull/1297 along the way
accelerate
docs say MPS backend will be used by default when available (apart from the caveat that their device map code doesn't support it yet) so I was hoping that would happen when device_map=None
But then I get this puzzling error:
vanilla-llama/llama/generation.py", line 121, in sample_top_p
next_token = torch.multinomial(probs_sort, num_samples=1)
NotImplementedError: Could not run 'aten::multinomial' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::multinomial' is only available for these backends: [CPU, MPS, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
I couldn't find much by googling for this, not helped by the fact that PyTorch itself seems to be part of Meta (Facebook) now
There is this https://pytorch.org/docs/stable/generated/torch.Tensor.is_meta.html and https://pytorch.org/torchdistx/latest/fake_tensor.html
Fake tensors, similar to meta tensors, carry no data; however, unlike meta tensors which report
meta
as their device, fake tensors act as if they were allocated on a real device.
Not sure why we are on the meta
device in this line of code?
It sounds like maybe device_map=None
has left us without a device
Adding an explicit device map in LLaMAInference
like:
device = torch.device("cpu")
if torch.has_cuda:
device = torch.device("cuda")
elif torch.has_mps:
device = torch.device('mps')
if device_map is None:
modules = (
"transformer",
"tok_embeddings",
"layers",
"norm",
"output",
)
device_map = {module: device for module in modules}
...this gets further!
Now I get this error:
vanilla-llama/llama/model.py", line 62, in apply_rotary_emb
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
NotImplementedError: The operator 'aten::view_as_complex' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
Finally I tried with the PYTORCH_ENABLE_MPS_FALLBACK=1
flag suggested and got:
vanilla-llama/llama/model.py:63:
UserWarning: 0The operator aten::view_ascomplex appears to be a view operator, but it has no implementation for the backend "mps:0". View operators don't support falling back to run on the CPU, since the tensor's storage cannot be shared across devices. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/CPUFallback.cpp:181.) xk = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2)) libc++abi: terminating due to uncaught exception of type c10::TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype. Exception raised from getMPSScalarType at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:91 (most recent call first): frame #0: at::native::mps::getMPSScalarType(c10::ScalarType) + 180 (0x134721278 in libtorch_cpu.dylib)
which starts to look like running on MPS device is a dead-end
Some great sleuthing there @anentropic 👌 that last error is mentioned in this thread
Others have got this working it seems:
The LLaMA.cpp project enables LLaMA inference on Apple Silicon devices by using CPU, but faster inference should be possible by supporting the M1/Pro/Max GPU on
vanilla-llama
, given that PyTorch is now M1 compatible using the 'mps' device.I'm new to Python but my observations:
In both
generation.py
andmodel.py
there are uses of the function.cuda()
which can be replaced withWhen attempting to run
example.py
after this; it's the Accelerate framework which throws an error with:RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
- something is trying to usecpu
instead ofmps
.I wonder if this is because the call into accelerate is
load_checkpoint_and_dispatch
withauto
provided as the device map - is PyTorch preferringcpu
overmps
here for some reason. Edit: This