Vahe1994 / AQLM

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
Apache License 2.0
1.13k stars 173 forks source link

aqlm/inference_kernels/cuda_kernel.cu compilation errors #61

Closed amrothemich closed 5 months ago

amrothemich commented 5 months ago

Hi! I'm having the following issue on the forward pass (only when using an AQLM model) while prompt tuning an AQLM model. I'm using https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch/tree/main.


CalledProcessError Traceback (most recent call last) File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2096, in _run_ninja_build(build_directory, verbose, error_prefix) 2095 stdout_fileno = 1 -> 2096 subprocess.run( 2097 command, 2098 stdout=stdout_fileno if verbose else subprocess.PIPE, 2099 stderr=subprocess.STDOUT, 2100 cwd=build_directory, 2101 check=True, 2102 env=env) 2103 except subprocess.CalledProcessError as e: 2104 # Python 2 and 3 compatible way of getting the error object.

File /usr/lib/python3.10/subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs) 525 if check and retcode: --> 526 raise CalledProcessError(retcode, process.args, 527 output=stdout, stderr=stderr) 528 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last) File , line 41 39 batch = {k: v.to(device) for k, v in batch.items()} 40 with torch.cuda.amp.autocast(): ---> 41 outputs = model(**batch) 42 loss = outputs.loss 44 loss = loss / gradient_accumulation_steps

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, kwargs) 1509 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(args, kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, *kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(args, **kwargs) 1522 try: 1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/peft/peft_model.py:1295, in PeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, kwargs) 1293 prompts = prompts.to(inputs_embeds.dtype) 1294 inputs_embeds = torch.cat((prompts, inputs_embeds), dim=1) -> 1295 return self.base_model(inputs_embeds=inputs_embeds, kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, kwargs) 1509 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(args, kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, *kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(args, **kwargs) 1522 try: 1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1360, in MixtralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict) 1357 return_dict = return_dict if return_dict is not None else self.config.use_return_dict 1359 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) -> 1360 outputs = self.model( 1361 input_ids=input_ids, 1362 attention_mask=attention_mask, 1363 position_ids=position_ids, 1364 past_key_values=past_key_values, 1365 inputs_embeds=inputs_embeds, 1366 use_cache=use_cache, 1367 output_attentions=output_attentions, 1368 output_hidden_states=output_hidden_states, 1369 output_router_logits=output_router_logits, 1370 return_dict=return_dict, 1371 ) 1373 hidden_states = outputs[0] 1374 logits = self.lm_head(hidden_states)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, kwargs) 1509 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(args, kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, *kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(args, **kwargs) 1522 try: 1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1217, in MixtralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict) 1214 all_hidden_states += (hidden_states,) 1216 if self.gradient_checkpointing and self.training: -> 1217 layer_outputs = self._gradient_checkpointing_func( 1218 decoder_layer.call, 1219 hidden_states, 1220 attention_mask, 1221 position_ids, 1222 past_key_values, 1223 output_attentions, 1224 output_router_logits, 1225 use_cache, 1226 ) 1227 else: 1228 layer_outputs = decoder_layer( 1229 hidden_states, 1230 attention_mask=attention_mask, (...) 1235 use_cache=use_cache, 1236 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_compile.py:24, in _disable_dynamo..inner(*args, kwargs) 20 @functools.wraps(fn) 21 def inner(*args, *kwargs): 22 import torch._dynamo ---> 24 return torch._dynamo.disable(fn, recursive)(args, kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:489, in _TorchDynamoContext.call.._fn(*args, *kwargs) 487 dynamo_config_ctx.enter() 488 try: --> 489 return fn(args, **kwargs) 490 finally: 491 set_eval_frame(prior)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_dynamo/external_utils.py:17, in wrap_inline..inner(*args, kwargs) 15 @functools.wraps(fn) 16 def inner(*args, *kwargs): ---> 17 return fn(args, kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/checkpoint.py:482, in checkpoint(function, use_reentrant, context_fn, determinism_check, debug, *args, kwargs) 477 if context_fn is not noop_context_fn or debug is not False: 478 raise ValueError( 479 "Passing context_fn or debug is only supported when " 480 "use_reentrant=False." 481 ) --> 482 return CheckpointFunction.apply(function, preserve, args) 483 else: 484 gen = _checkpoint_without_reentrant_generator( 485 function, preserve, context_fn, determinism_check, debug, args, kwargs 486 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/autograd/function.py:553, in Function.apply(cls, *args, *kwargs) 550 if not torch._C._are_functorch_transforms_active(): 551 # See NOTE: [functorch vjp and autograd interaction] 552 args = _functorch.utils.unwrap_dead_wrappers(args) --> 553 return super().apply(args, **kwargs) # type: ignore[misc] 555 if not is_setup_ctx_defined: 556 raise RuntimeError( 557 "In order to use an autograd.Function with functorch transforms " 558 "(vmap, grad, jvp, jacrev, ...), it must override the setup_context " 559 "staticmethod. For more details, please see " 560 "https://pytorch.org/docs/master/notes/extending.func.html" 561 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/checkpoint.py:261, in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, args) 258 ctx.save_for_backward(tensor_inputs) 260 with torch.no_grad(): --> 261 outputs = run_function(*args) 262 return outputs

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, kwargs) 1509 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(args, kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, *kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(args, **kwargs) 1522 try: 1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:934, in MixtralDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, output_router_logits, use_cache, **kwargs) 931 hidden_states = self.input_layernorm(hidden_states) 933 # Self Attention --> 934 hidden_states, self_attn_weights, present_key_value = self.self_attn( 935 hidden_states=hidden_states, 936 attention_mask=attention_mask, 937 position_ids=position_ids, 938 past_key_value=past_key_value, 939 output_attentions=output_attentions, 940 use_cache=use_cache, 941 ) 942 hidden_states = residual + hidden_states 944 # Fully Connected

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, kwargs) 1509 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(args, kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, *kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(args, **kwargs) 1522 try: 1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:730, in MixtralSdpaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache) 719 return super().forward( 720 hidden_states=hidden_states, 721 attention_mask=attention_mask, (...) 725 use_cache=use_cache, 726 ) 728 bsz, qlen, = hidden_states.size() --> 730 query_states = self.q_proj(hidden_states) 731 key_states = self.k_proj(hidden_states) 732 value_states = self.v_proj(hidden_states)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, kwargs) 1509 return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] 1510 else: -> 1511 return self._call_impl(args, kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, *kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(args, **kwargs) 1522 try: 1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference.py:70, in QuantizedLinear.forward(self, input) 68 def forward(self, input: torch.Tensor) -> torch.Tensor: 69 if self.gemv_op is None: ---> 70 self.prepare_matmul_op(input) 72 if self.use_gemv_rule(input): 73 return self.gemv_op.apply(input, self.codes, self.codebooks, self.scales, self.bias)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference.py:86, in QuantizedLinear.prepare_matmul_op(self, input) 78 if ( 79 not input.is_cuda 80 and self.codebook_size == 256 81 and self.codes.shape[0] == self.out_features // self.out_group_size 82 ): 83 self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous() # TODO: fix this thing 85 self.gemv_op = _get_autograd_matmul_op( ---> 86 get_forward_pass_kernel(self.codebooks, False), 87 get_backward_pass_kernel(self.codebooks, False), 88 ) 90 self.gemm_op = _get_autograd_matmul_op( 91 get_forward_pass_kernel(self.codebooks, True), 92 get_backward_pass_kernel(self.codebooks, True), 93 ) 95 self.use_gemv_rule = lambda input: math.prod(input.shape[:-1]) <= 6

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/kernel_selector.py:35, in get_forward_pass_kernel(codebooks, optimize_for_training) 25 num_codebooks, codebook_size, out_group_size, in_group_size = codebooks.shape 27 if (optimize_for_training, codebooks.device.type, num_codebooks, codebook_size, out_group_size, in_group_size) == ( 28 False, 29 "cuda", (...) 33 8, 34 ): ---> 35 from .cuda_kernel import CUDA_FOLDER 37 return torch.ops.aqlm.code1x16_matmat 38 elif ( 39 optimize_for_training, 40 codebooks.device.type, (...) 51 8, 52 ):

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.py:8 5 from torch.utils.cpp_extension import load 7 CUDA_FOLDER = os.path.dirname(os.path.abspath(file)) ----> 8 CUDA_KERNEL = load( 9 name="codebook_cuda", 10 sources=[os.path.join(CUDA_FOLDER, "cuda_kernel.cpp"), os.path.join(CUDA_FOLDER, "cuda_kernel.cu")], 11 ) 13 torch.library.define( 14 "aqlm::code1x16_matmat", "(Tensor input, Tensor codes, Tensor codebooks, Tensor scales, Tensor bias) -> Tensor" 15 ) 17 torch.library.impl("aqlm::code1x16_matmat", "default", CUDA_KERNEL.code1x16_matmat)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1306, in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates) 1214 def load(name, 1215 sources: Union[str, List[str]], 1216 extra_cflags=None, (...) 1224 is_standalone=False, 1225 keep_intermediates=True): 1226 """ 1227 Load a PyTorch C++ extension just-in-time (JIT). 1228 (...) 1304 ... verbose=True) 1305 """ -> 1306 return _jit_compile( 1307 name, 1308 [sources] if isinstance(sources, str) else sources, 1309 extra_cflags, 1310 extra_cuda_cflags, 1311 extra_ldflags, 1312 extra_include_paths, 1313 build_directory or _get_build_directory(name, verbose), 1314 verbose, 1315 with_cuda, 1316 is_python_module, 1317 is_standalone, 1318 keep_intermediates=keep_intermediates)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1710, in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates) 1706 hipified_sources.add(hipify_result[s_abs].hipified_path if s_abs in hipify_result else s_abs) 1708 sources = list(hipified_sources) -> 1710 _write_ninja_file_and_build_library( 1711 name=name, 1712 sources=sources, 1713 extra_cflags=extra_cflags or [], 1714 extra_cuda_cflags=extra_cuda_cflags or [], 1715 extra_ldflags=extra_ldflags or [], 1716 extra_include_paths=extra_include_paths or [], 1717 build_directory=build_directory, 1718 verbose=verbose, 1719 with_cuda=with_cuda, 1720 is_standalone=is_standalone) 1721 finally: 1722 baton.release()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1823, in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone) 1821 if verbose: 1822 print(f'Building extension module {name}...', file=sys.stderr) -> 1823 _run_ninja_build( 1824 build_directory, 1825 verbose, 1826 error_prefix=f"Error building extension '{name}'")

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2112, in _run_ninja_build(build_directory, verbose, error_prefix) 2110 if hasattr(error, 'output') and error.output: # type: ignore[union-attr] 2111 message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}" # type: ignore[union-attr] -> 2112 raise RuntimeError(message) from e

RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o FAILED: cuda_kernel.cuda.o /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(270): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists

/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(270): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists

/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(256): warning #177-D: variable "res" was declared but never referenced

2 errors detected in the compilation of "/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu". [2/3] c++ -MMD -MF cuda_kernel.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cpp -o cuda_kernel.o ninja: build stopped: subcommand failed.

BlackSamorez commented 5 months ago

Please provide the following information:

amrothemich commented 5 months ago

Thanks!

aqlm==1.1.3 torch: 2.2.2+cu121 cuda: 12.1 GPU: Tesla V100 via Databricks: NCv3 VM https://learn.microsoft.com/en-us/azure/virtual-machines/ncv3-series

(And FYI using ISTA-DASLab/Mixtral-8x7B-Instruct-v0_1-AQLM-2Bit-1x16-hf, I'm still having the same issue)

BlackSamorez commented 5 months ago

@amrothemich v100 doesn't really support efficient bf16 operations. If you were to update aqlm to the latest version it would perform a check for Compute Capability to display a more readable error.

amrothemich commented 5 months ago

Okay got it, thanks. So is bf16 a requirement for the whole library or just this model?

BlackSamorez commented 5 months ago

bfloat16 is not a requirement at all. You can pass torch_dtype=torch.float16 to from_pretrained to use standard half precision. It should work for any model no problem.

BlackSamorez commented 5 months ago

Actually, float16 is the default for the models we put out. No need to specify it explicitly. But first please update aqlm. It simply wouldn't compile otherwise.

amrothemich commented 5 months ago

Yeah, I didn't specify in the original.

Isn't 1.1.3 the latest version? I think it's the newest on pip, should I install from github?

On Tue, Apr 2, 2024 at 3:11 PM Andrei Panferov @.***> wrote:

Actually, float16 is the default for the models we put out. No need to specify it explicitly. But first please update aqlm. It simply wouldn't compile otherwise.

— Reply to this email directly, view it on GitHub https://github.com/Vahe1994/AQLM/issues/61#issuecomment-2032865554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJCVQK5XB2SNAX5NSUQ2JDY3L7FTAVCNFSM6AAAAABFONH4ZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZSHA3DKNJVGQ . You are receiving this because you were mentioned.Message ID: @.***>

BlackSamorez commented 5 months ago

I see, I'm sorry. It's not that you didn't update, it's actually the opposite: I broke it in the latest release. The latest dequantization kernels won't compile on GPUs with Compute Capability of 8 or less. For now, you can downgrade to 1.1.2 and everything should work. I'll try and fix the error to allow you to use the latest dequantization kernels as well.

amrothemich commented 5 months ago

Ah okay got it, thanks so much for your help!

On Tue, Apr 2, 2024 at 5:21 PM Andrei Panferov @.***> wrote:

I see, I'm sorry. It's not that you didn't update, it's actually the opposite: I broke it in the latest release. The latest dequantization kernels won't compile on GPUs with Compute Capability of 8 or less. For now, you can downgrade to 1.1.2 and everything should work. I'll try and fix the error to allow you to use the latest dequantization kernels as well.

— Reply to this email directly, view it on GitHub https://github.com/Vahe1994/AQLM/issues/61#issuecomment-2033120550, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJCVQJGQW6WTH23EVSGDOLY3MOM7AVCNFSM6AAAAABFONH4ZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZTGEZDANJVGA . You are receiving this because you were mentioned.Message ID: @.***>

BlackSamorez commented 5 months ago

Please update to aqlm>=1.1.4 and it should resolve the issue. Feel free to reopen it if it doesn't.