Generate Llama 2 from Embeddings

liechtym commented 9 months ago

Compiling and loading Llama 2 in Neuron is working great for me on a inf2.8xlarge with the new release 2.16.

However, I have a unique use case where I need to be able to input embeddings directly into Llama 2 instead of token ids. I need to be able to generate the embeddings, modify the embeddings, and then use the embeddings for generation. I was already able to generate the embeddings separately via llama_model.chkpt_model.model.embed_tokens(token_ids). However, I'm not seeing a way to plug those embeddings into the model once I've modified them.

It seems to me that LlamaForSampling.sample() (from transformers_neuronx.llama.model) probably can't do this (correct me if I'm wrong). I got TypeError: sample() got an unexpected keyword argument 'inputs_embeds' when I tried.

So, I tried using the HuggingFaceGenerationModelAdapter from transformers_neuronx.generation_utils to enable using the generation API as was performed on this GP2 example. However, there was an error that prevented that, which I filed an issue for in the tranfomers repo.

What is the best way to go about doing this? I really appreciate your help.

liechtym commented 8 months ago

In transformers repo they said the HuggingFaceGenerationModelAdapter incompatibility error is probably stemming from the tranfomers-neuronx wrapper. Any help with this?

Here is the error:

Traceback (most recent call last):
  File "modular.py", line 107, in <module>
    chatbot = MiniGPT4LLama2Chatbot(cfg_path, gpu_id)
  File "modular.py", line 62, in __init__
    self.model = model_cls.from_config(model_config)
  File "/home/ubuntu/MiniGPT-4/minigpt4/models/minigpt4.py", line 173, in from_config
    model = cls(
  File "/home/ubuntu/MiniGPT-4/minigpt4/models/minigpt4.py", line 45, in __init__
    super().__init__(
  File "/home/ubuntu/MiniGPT-4/minigpt4/models/minigpt_base.py", line 43, in __init__
    self.llama_model, self.llama_tokenizer = self.init_llm(
  File "/home/ubuntu/MiniGPT-4/minigpt4/models/base_model.py", line 202, in init_llm
    llama_model = HuggingFaceGenerationModelAdapter(llama_model_cpu.config, llama_model_neuron)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/generation_utils.py", line 18, in __init__
    super().__init__(config)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1190, in __init__
    config = self._autoset_attn_implementation(
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1311, in _autoset_attn_implementation
    config = cls._check_and_enable_sdpa(
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1464, in _check_and_enable_sdpa
    raise ValueError(
ValueError: HuggingFaceGenerationModelAdapter does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please open an issue on GitHub to request support for this architecture: https://github.com/huggingface/transformers/issues/new

See more details on the issue page: https://github.com/huggingface/transformers/issues/28396.

Of course my general goal is to simply get this working with input embeddings so if this is not the right route, let me know.

shebbur-aws commented 8 months ago

Hi @liechtym , We do not have support for external embeddings. One way you could potentially get around this is by replacing the model embedding weights directly. Please let us know if that helps.

liechtym commented 8 months ago

@shebbur-aws Thanks for your reply. A workaround is totally fine for me. Would you be able to give a quick explanation/example for how to replace the embedding weights and run the forward pass on the rest of the model?

liechtym commented 8 months ago

Could I get help on this @shebbur-aws ?

davidshtian commented 5 months ago

@liechtym @shebbur-aws Hi~ I've got the same situation here, do you have any resolution or workaround on this? Input embeds as model input parameter instead of input ids. Thanks~

aws-neuron / transformers-neuronx

Generate Llama 2 from Embeddings #72