Open michaelfeil opened 8 months ago
Thanks for reporting @michaelfeil - we'll get back to you soon.
@micwade-aws Thanks, looking very much forward to your answer. FYI @jimburtoft our discussion today
+1 on this thread. Furthermore, any way to get the hidden states of the last layer?
@michaelfeil Here is one thing you could try:
To return model forward scores during inference, you can use the HuggingFaceGenerationModelAdapter
. This wrapper supports the Hugging Face generate() API functionality, including the ability to return model forward scores. The only behavioral difference that you may notice is that we only produce scores for the final token in the prompt (rather than a score for each prompt token).
Here is an example of how to use this wrapper to access the model forward scores:
# Model config object
config = ...
# Create your Neuron model
neuron_model = ...
# Compile you Neuron model
neuron_model.to_neuron()
# Create the Hugging Face wrapper model
neuron = HuggingFaceGenerationModelAdapter(config, neuron_model)
# Run inference using the Hugging Face generate API
# Pass in `output_scores=True, return_dict_in_generate=True` to return the scores
result = neuron.generate(inputs, ..., output_scores=True, return_dict_in_generate=True)
# Retrieve the tokens
tokens = result.sequences
# Retrieve the scores
scores = result.scores
For additional information about the HuggingFaceGenerationModelAdapter
wrapper, you can visit this documentation.
Let me know if this solves the original issue.
@jluntamazon Thanks for your response! My issue was more directed to get the whole sequences logits, specifically to estimate the metrics for lm-eval-harness.
Hi @michaelfeil:
We added the ability to return all input prompt context encoding logits in the 2.19 Release. This is enabled by setting output_all_logits=True
in the NeuronConfig
during Neuron model initialization.
Please note that the model.sample()
and HuggingFaceGenerationModelAdapter.generate()
APIs do not yet support returning all context encoding logits. For now, you must call the Neuron model directly to return the context encoding logits.
Here is an example of how to use output_all_logits=True
to access the logits for all input tokens:
import torch
from transformers_neuronx import NeuronAutoModelForCausalLM, NeuronConfig
# Original model checkpoint location
checkpoint = ...
# Create your Neuron model with output_all_logits=True to return all logits during inference
neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
checkpoint,
...,
neuron_config = NeuronConfig(..., output_all_logits=True)
)
# Compile your Neuron model
neuron_model.to_neuron()
# Prepare your inputs
input_ids = ...
_, context_length = input_ids.shape
cache_ids = torch.arange(0, context_length, dtype=torch.int32)
start_ids = torch.zeros(1, dtype=torch.int32)
# Perform context encoding and return all logits for each input token
logits = neuron_model(input_ids, cache_ids, start_ids)
Please let us know if this provides the behavior you are looking for.
I am trying to retrieve the logits from the model, to use https://github.com/EleutherAI/lm-evaluation-harness/blob/692e0f83b5341b543fa288f84289617f793e4e93/lm_eval/models/huggingface.py#L972
Huggingface
transformers
In
transformers
I can the logits from the forward pass:transformers-neuronx
In simple Pytorch words
Update: 1/11
I got this to work. However, I am using the undocumented
cache_ids
feature.The output seems correct, but the code is terribly slow,. My local laptop gpu RTX3060M runs TinyLLama1.1B around 25x faster.
Using LLamaForSampling, inf2.8xlarge instance, tp_degree=2, neuron 2.15.9