[BUG] ValueError: XFormers does not support attention logits soft capping.

daegonYu commented 1 month ago

Describe the bug ValueError: XFormers does not support attention logits soft capping.

Full Error log { "name": "ValueError", "message": "XFormers does not support attention logits soft capping.", "stack": "--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[7], line 3 1 import nest_asyncio 2 nest_asyncio.apply() ----> 3 evaluator.start_trial(yaml_path)

File /home3/dgon/NLP/gits/AutoRAG/autorag/evaluator.py:126, in Evaluator.start_trial(self, yaml_path) 124 \t\tprevious_result = self.qa_data 125 \tlogger.info(f\"Running node line {node_line_name}...\") --> 126 \tprevious_result = run_node_line(node_line, node_line_dir, previous_result) 128 \ttrial_summary_df = self._append_node_line_summary( 129 \t\tnode_line_name, node_line_dir, trial_summary_df 130 \t) 132 trial_summary_df.to_csv( 133 \tos.path.join(self.project_dir, trial_name, \"summary.csv\"), index=False 134 )

File /home3/dgon/NLP/gits/AutoRAG/autorag/node_line.py:47, in run_node_line(nodes, node_line_dir, previous_result) 45 summary_lst = [] 46 for node in nodes: ---> 47 \tprevious_result = node.run(previous_result, node_line_dir) 48 \tnode_summary_df = load_summary_file( 49 \t\tos.path.join(node_line_dir, node.node_type, \"summary.csv\") 50 \t) 51 \tbest_node_row = node_summary_df.loc[node_summary_df[\"is_best\"]]

File /home3/dgon/NLP/gits/AutoRAG/autorag/schema/node.py:57, in Node.run(self, previous_result, node_line_dir) 55 logger.info(f\"Running node {self.node_type}...\") 56 input_modules, input_params = self.get_param_combinations() ---> 57 return self.run_node( 58 \tmodules=input_modules, 59 \tmodule_params=input_params, 60 \tprevious_result=previous_result, 61 \tnode_line_dir=node_line_dir, 62 \tstrategies=self.strategy, 63 )

File /home3/dgon/NLP/gits/AutoRAG/autorag/nodes/generator/run.py:46, in run_generator_node(modules, module_params, previous_result, node_line_dir, strategies) 43 \traise ValueError(\"You must have 'generation_gt' column in qa.parquet.\") 44 generation_gt = list(map(lambda x: x.tolist(), qa_data[\"generation_gt\"].tolist())) ---> 46 results, execution_times = zip( 47 \t*map( 48 \t\tlambda x: measure_speed( 49 \t\t\tx[0], project_dir=project_dir, previous_result=previous_result, **x[1] 50 \t\t), 51 \t\tzip(modules, module_params), 52 \t) 53 ) 54 average_times = list(map(lambda x: x / len(results[0]), execution_times)) 56 # get average token usage

File /home3/dgon/NLP/gits/AutoRAG/autorag/nodes/generator/run.py:48, in run_generator_node..(x) 43 \traise ValueError(\"You must have 'generation_gt' column in qa.parquet.\") 44 generation_gt = list(map(lambda x: x.tolist(), qa_data[\"generation_gt\"].tolist())) 46 results, execution_times = zip( 47 \t*map( ---> 48 \t\tlambda x: measure_speed( 49 \t\t\tx[0], project_dir=project_dir, previous_result=previous_result, **x[1] 50 \t\t), 51 \t\tzip(modules, module_params), 52 \t) 53 ) 54 average_times = list(map(lambda x: x / len(results[0]), execution_times)) 56 # get average token usage

File /home3/dgon/NLP/gits/AutoRAG/autorag/strategy.py:14, in measure_speed(func, *args, *kwargs) 10 \"\"\" 11 Method for measuring execution speed of the function. 12 \"\"\" 13 start_time = time.time() ---> 14 result = func(args, **kwargs) 15 end_time = time.time() 16 return result, end_time - start_time

File /home3/dgon/NLP/gits/AutoRAG/autorag/utils/util.py:67, in result_to_dataframe..decorator_result_to_dataframe..wrapper(*args, kwargs) 65 @functools.wraps(func) 66 def wrapper(*args, *kwargs) -> pd.DataFrame: ---> 67 \tresults = func(args, kwargs) 68 \tif len(column_names) == 1: 69 \t\tdf_input = {column_names[0]: results}

File /home3/dgon/NLP/gits/AutoRAG/autorag/nodes/generator/base.py:49, in generator_node..wrapper(project_dir, previous_result, llm, kwargs) 47 \treturn result 48 else: ---> 49 \treturn func(prompts=prompts, llm=llm, kwargs)

File /home3/dgon/NLP/gits/AutoRAG/autorag/nodes/generator/vllm.py:38, in vllm(prompts, llm, **kwargs) 33 \traise ImportError( 34 \t\t\"Please install vllm library. You can install it by running pip install vllm.\" 35 \t) 37 input_kwargs = deepcopy(kwargs) ---> 38 vllm_model = make_vllm_instance(llm, input_kwargs) 40 if \"logprobs\" not in input_kwargs: 41 \tinput_kwargs[\"logprobs\"] = 1

File /home3/dgon/NLP/gits/AutoRAG/autorag/nodes/generator/vllm.py:74, in make_vllm_instance(llm, input_args) 72 \tif v is not None: 73 \t\tinput_kwargs[param] = v ---> 74 return LLM(model, **input_kwargs)

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/entrypoints/llm.py:177, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, kwargs) 153 raise TypeError( 154 \"There is no need to pass vision-related arguments anymore.\") 155 engine_args = EngineArgs( 156 model=model, 157 tokenizer=tokenizer, (...) 175 kwargs, 176 ) --> 177 self.llm_engine = LLMEngine.from_engine_args( 178 engine_args, usage_context=UsageContext.LLM_CLASS) 179 self.request_counter = Counter()

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/engine/llm_engine.py:538, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers) 536 executor_class = cls._get_executor_cls(engine_config) 537 # Create the LLM engine. --> 538 engine = cls( 539 **engine_config.to_dict(), 540 executor_class=executor_class, 541 log_stats=not engine_args.disable_log_stats, 542 usage_context=usage_context, 543 stat_loggers=stat_loggers, 544 ) 546 return engine

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/engine/llm_engine.py:305, in LLMEngine.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers, input_registry, step_return_finished_only) 301 self.input_registry = input_registry 302 self.input_processor = input_registry.create_input_processor( 303 model_config) --> 305 self.model_executor = executor_class( 306 model_config=model_config, 307 cache_config=cache_config, 308 parallel_config=parallel_config, 309 scheduler_config=scheduler_config, 310 device_config=device_config, 311 lora_config=lora_config, 312 speculative_config=speculative_config, 313 load_config=load_config, 314 prompt_adapter_config=prompt_adapter_config, 315 observability_config=self.observability_config, 316 ) 318 if not self.model_config.embedding_mode: 319 self._initialize_kv_caches()

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/executor/executor_base.py:47, in ExecutorBase.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, prompt_adapter_config, observability_config) 45 self.prompt_adapter_config = prompt_adapter_config 46 self.observability_config = observability_config ---> 47 self._init_executor()

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:40, in GPUExecutor._init_executor(self) 38 self.driver_worker = self._create_worker() 39 self.driver_worker.init_device() ---> 40 self.driver_worker.load_model()

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/worker/worker.py:182, in Worker.load_model(self) 181 def load_model(self): --> 182 self.model_runner.load_model()

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/worker/model_runner.py:917, in GPUModelRunnerBase.load_model(self) 915 logger.info(\"Starting to load model %s...\", self.model_config.model) 916 with CudaMemoryProfiler() as m: --> 917 self.model = get_model(model_config=self.model_config, 918 device_config=self.device_config, 919 load_config=self.load_config, 920 lora_config=self.lora_config, 921 parallel_config=self.parallel_config, 922 scheduler_config=self.scheduler_config, 923 cache_config=self.cache_config) 925 self.model_memory_usage = m.consumed_memory 926 logger.info(\"Loading model weights took %.4f GB\", 927 self.model_memory_usage / float(2**30))

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, cache_config) 13 def get_model(*, model_config: ModelConfig, load_config: LoadConfig, 14 device_config: DeviceConfig, parallel_config: ParallelConfig, 15 scheduler_config: SchedulerConfig, 16 lora_config: Optional[LoRAConfig], 17 cache_config: CacheConfig) -> nn.Module: 18 loader = get_model_loader(load_config) ---> 19 return loader.load_model(model_config=model_config, 20 device_config=device_config, 21 lora_config=lora_config, 22 parallel_config=parallel_config, 23 scheduler_config=scheduler_config, 24 cache_config=cache_config)

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:341, in DefaultModelLoader.load_model(self, model_config, device_config, lora_config, parallel_config, scheduler_config, cache_config) 339 with set_default_torch_dtype(model_config.dtype): 340 with target_device: --> 341 model = _initialize_model(model_config, self.load_config, 342 lora_config, cache_config, 343 scheduler_config) 344 model.load_weights( 345 self._get_weights_iterator(model_config.model, 346 model_config.revision, (...) 349 \"fall_back_to_pt_duringload\", 350 True)), ) 352 for , module in model.named_modules():

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:170, in _initialize_model(model_config, load_config, lora_config, cache_config, scheduler_config) 167 \"\"\"Initialize a model with the given configurations.\"\"\" 168 modelclass, = get_model_architecture(model_config) --> 170 return build_model( 171 model_class, 172 model_config.hf_config, 173 cache_config=cache_config, 174 quant_config=_get_quantization_config(model_config, load_config), 175 lora_config=lora_config, 176 multimodal_config=model_config.multimodal_config, 177 scheduler_config=scheduler_config, 178 )

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py:155, in build_model(model_class, hf_config, cache_config, quant_config, lora_config, multimodal_config, scheduler_config) 145 def build_model(model_class: Type[nn.Module], hf_config: PretrainedConfig, 146 cache_config: Optional[CacheConfig], 147 quant_config: Optional[QuantizationConfig], *, 148 lora_config: Optional[LoRAConfig], 149 multimodal_config: Optional[MultiModalConfig], 150 scheduler_config: Optional[SchedulerConfig]) -> nn.Module: 151 extra_kwargs = _get_model_initialization_kwargs(model_class, lora_config, 152 multimodal_config, 153 scheduler_config) --> 155 return model_class(config=hf_config, 156 cache_config=cache_config, 157 quant_config=quant_config, 158 **extra_kwargs)

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py:329, in Gemma2ForCausalLM.init(failed resolving arguments) 327 assert config.tie_word_embeddings 328 self.quant_config = quant_config --> 329 self.model = Gemma2Model(config, cache_config, quant_config) 330 self.logits_processor = LogitsProcessor( 331 config.vocab_size, soft_cap=config.final_logit_softcapping) 332 self.sampler = Sampler()

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py:255, in Gemma2Model.init(self, config, cache_config, quant_config) 249 self.config = config 251 self.embed_tokens = VocabParallelEmbedding( 252 config.vocab_size, 253 config.hidden_size, 254 ) --> 255 self.layers = nn.ModuleList([ 256 Gemma2DecoderLayer(layer_idx, config, cache_config, quant_config) 257 for layer_idx in range(config.num_hidden_layers) 258 ]) 259 self.norm = GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps) 261 # Normalize the embedding by sqrt(hidden_size) 262 # The normalizer's data type should be downcasted to the model's 263 # data type such as bfloat16, not float32. 264 # See https://github.com/huggingface/transformers/pull/29402

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py:256, in (.0) 249 self.config = config 251 self.embed_tokens = VocabParallelEmbedding( 252 config.vocab_size, 253 config.hidden_size, 254 ) 255 self.layers = nn.ModuleList([ --> 256 Gemma2DecoderLayer(layer_idx, config, cache_config, quant_config) 257 for layer_idx in range(config.num_hidden_layers) 258 ]) 259 self.norm = GemmaRMSNorm(config.hidden_size, eps=config.rms_norm_eps) 261 # Normalize the embedding by sqrt(hidden_size) 262 # The normalizer's data type should be downcasted to the model's 263 # data type such as bfloat16, not float32. 264 # See https://github.com/huggingface/transformers/pull/29402

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py:181, in Gemma2DecoderLayer.init(self, layer_idx, config, cache_config, quant_config) 179 super().init() 180 self.hidden_size = config.hidden_size --> 181 self.self_attn = Gemma2Attention( 182 layer_idx=layer_idx, 183 config=config, 184 hidden_size=self.hidden_size, 185 num_heads=config.num_attention_heads, 186 num_kv_heads=config.num_key_value_heads, 187 head_dim=config.head_dim, 188 max_position_embeddings=config.max_position_embeddings, 189 rope_theta=config.rope_theta, 190 cache_config=cache_config, 191 quant_config=quant_config, 192 attn_logits_soft_cap=config.attn_logit_softcapping, 193 ) 194 self.hidden_size = config.hidden_size 195 self.mlp = Gemma2MLP( 196 hidden_size=self.hidden_size, 197 intermediate_size=config.intermediate_size, (...) 200 quant_config=quant_config, 201 )

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/model_executor/models/gemma2.py:147, in Gemma2Attention.init(self, layer_idx, config, hidden_size, num_heads, num_kv_heads, head_dim, max_position_embeddings, rope_theta, cache_config, quant_config, attn_logits_soft_cap) 144 use_sliding_window = (layer_idx % 2 == 1 145 and config.sliding_window is not None) 146 del use_sliding_window # Unused. --> 147 self.attn = Attention(self.num_heads, 148 self.head_dim, 149 self.scaling, 150 num_kv_heads=self.num_kv_heads, 151 cache_config=cache_config, 152 quant_config=quant_config, 153 logits_soft_cap=attn_logits_soft_cap)

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/attention/layer.py:84, in Attention.init(self, num_heads, head_size, scale, num_kv_heads, alibi_slopes, cache_config, quant_config, blocksparse_params, logits_soft_cap, prefix) 79 attn_backend = get_attn_backend(num_heads, head_size, num_kv_heads, 80 sliding_window, dtype, kv_cache_dtype, 81 block_size, blocksparse_params 82 is not None) 83 impl_cls = attn_backend.get_impl_cls() ---> 84 self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads, 85 alibi_slopes, sliding_window, kv_cache_dtype, 86 blocksparse_params, logits_soft_cap)

File ~/anaconda3/envs/autorag/lib/python3.10/site-packages/vllm/attention/backends/xformers.py:422, in XFormersImpl.init(self, num_heads, head_size, scale, num_kv_heads, alibi_slopes, sliding_window, kv_cache_dtype, blocksparse_params, logits_soft_cap) 419 raise ValueError( 420 \"XFormers does not support block-sparse attention.\") 421 if logits_soft_cap is not None: --> 422 raise ValueError( 423 \"XFormers does not support attention logits soft capping.\") 424 self.num_heads = num_heads 425 self.head_size = head_size

ValueError: XFormers does not support attention logits soft capping." }

Code that bug is happened

from autorag.evaluator import Evaluator

evaluator = Evaluator(qa_data_path='qd.parquet', corpus_data_path='corpus.parquet',
                      project_dir='local_result')

import nest_asyncio
nest_asyncio.apply()
evaluator.start_trial(yaml_path)

# This config YAML file does not contain any optimization.
node_lines:
- node_line_name: retrieve_node_line  # Arbitrary node line name
  nodes:
    - node_type: retrieval
      strategy:
        metrics: [retrieval_f1, retrieval_recall, retrieval_precision]
      top_k: 3
      modules:
        - module_type: vectordb
          embedding_model: my_bge_model
- node_line_name: post_retrieve_node_line  # Arbitrary node line name
  nodes:
    - node_type: prompt_maker
      strategy:
        metrics: [ meteor, rouge, bert_score ]
      modules:
        - module_type: fstring
          prompt: "주어진 문서만을 이용하여 question에 따라 답하시오. 표와 텍스트 전부 확인하여 문장형태로 답변해줘. \n\n {retrieved_contents} \n\n Question: {query} \n\n Answer: "
    - node_type: generator
      strategy:
        metrics: [ meteor, rouge, bert_score ]
      modules:
        - module_type: vllm
          llm: google/gemma-2-9b-it
          dtype : bfloat16
          temperature: [ 0.1 ]
          max_tokens: 400

Desktop (please complete the following information):

OS: Linux
Python version 3.10.14

AutoRAG 0.2.15 torch 2.4.0+cu118

vkehfdl1 commented 1 month ago

@daegonYu It looks like vllm error. Somethings wrong at xformers version or feature?

daegonYu commented 1 month ago

My xformers version is 0.0.27.post2+cu118 I installed it according to the installation guide on xformers GitHub.

This error isn't on the xformers github, should I ask there?

vkehfdl1 commented 1 month ago

@daegonYu It will be great you asked about this error at there. Because I have never seen this error and the reason this happened is in the vllm and xformers. If I face this issue, I will investigate it

effortprogrammer commented 1 month ago

Yes, this issue is related with vllm. When you try to serve gemma-2 model using vllm without using Flashinfer backend, it automatically uses xformers backend. Unfortunately, xformers backend does not support attention logits soft capping.

One way that you can serve in xformers backend is removing all of the components related to attention logits soft capping. It may be minimal performance drop in gemma-2 9b, but will hugely impact on gemma-2 27b so be aware.

CC @vkehfdl1

daegonYu commented 1 month ago

@effortprogrammer Thank you for your reply. If I want to use the Flashinfer backend, can I do it with pip install?

effortprogrammer commented 1 month ago

@effortprogrammer Thank you for your reply. If I want to use the Flashinfer backend, can I do it with pip install?

Well, if you look at vllm hyperparameter options, there's environment variable to use backend whether using xformers, flashinfer, etc.

Please note that when trying to use Flashinfer, your gpu should support flash attention 2 and you need to ensure that flash attention 2 is installed in your environment.

vkehfdl1 commented 1 month ago

@effortprogrammer Thanks for valuable information!

@daegonYu You can set vllm parameter at the YAML file easily. Just add parameter name and value at the YAML file at the vllm module. AutoRAG will set the input of vllm.LLM, EngineArgs or SamplingParams in vllm. But, highly recommend to use AutoRAG v0.2.16 which released yesterday.

vkehfdl1 commented 1 month ago

It seems resolved using different vllm backend, I am closing this issue.

Marker-Inc-Korea / AutoRAG

[BUG] ValueError: XFormers does not support attention logits soft capping. #696