Infering logits from `model.forward` for the entire batch instead of the last forward's output.

michaelfeil commented 8 months ago

I am trying to retrieve the logits from the model, to use https://github.com/EleutherAI/lm-evaluation-harness/blob/692e0f83b5341b543fa288f84289617f793e4e93/lm_eval/models/huggingface.py#L972

Huggingface `transformers`

In transformers I can the logits from the forward pass:

# for models from transformers.AutoModelForCausalLM
# inps.shape = torch.Size([2, 205]) # we are running Batch size 2
# two sequences with contaxt
mylogits = self.hf_model(
    input_ids=inps,  # attention_mask=attn_mask, labels=labels
).logits
# mylogits.shape = torch.Size([2, 205, 32000]) # llama2 has 32000 tokens
# # [batch, padding_length , vocab]

# this contains the log-liklihood for each token, from which we can infer the certainty

`transformers-neuronx`

out = self.neuron_model(
    inps,  # attention_mask=attn_mask, labels=labels
)
# returns torch.Size([2, 32000]), only the last two logits. also already after softmax.

In simple Pytorch words

# What I get is equivalent to this:
neuron_model(batched_inps) =~= F.log_softmax(self.hf_model(batched_inps).logits[:,-1,:], dim=-1)
# But I want to compute  `the_magic_function`
self.hf_cuda_model(batched_inps).logits[:,:,:] =?= neuron_model.the_magic_function

Update: 1/11

I got this to work. However, I am using the undocumented cache_ids feature.

The output seems correct, but the code is terribly slow,. My local laptop gpu RTX3060M runs TinyLLama1.1B around 25x faster.

def logits_hf(input_ids):
    with torch.inference_mode():
        return self.hf_cuda_model(inps).logits

def logits(input_ids):
     """
    get logits for the entire sequence

    :param inps: torch.Tensor
        A torch tensor of shape [batch, sequence_cont]
        the size of sequence may vary from call to call
    :return
        A torch tensor of shape [batch, sequence, vocab] with the
    logits returned from the model's decoder
    """
    _, sequence_length = input_ids.shape

    with torch.inference_mode():
        cache_ids = torch.arange(0, sequence_length, dtype=torch.int32).split(1)
        input_ids_split = input_ids.split(1, dim=1)

        return torch.stack(
            [
                self.model(input_ids=input_id, cache_ids=cache_id)
                for input_id, cache_id in zip(input_ids_split, cache_ids)
            ],
            dim=1,
        )

Using LLamaForSampling, inf2.8xlarge instance, tp_degree=2, neuron 2.15.9

self.neuron_model = LLamaForSampling.from_pretrained(..)
self.neuron_model.to_neuron()

micwade-aws commented 8 months ago

Thanks for reporting @michaelfeil - we'll get back to you soon.

michaelfeil commented 8 months ago

@micwade-aws Thanks, looking very much forward to your answer. FYI @jimburtoft our discussion today

zhouku92 commented 8 months ago

+1 on this thread. Furthermore, any way to get the hidden states of the last layer?

jluntamazon commented 5 months ago

@michaelfeil Here is one thing you could try:

To return model forward scores during inference, you can use the HuggingFaceGenerationModelAdapter. This wrapper supports the Hugging Face generate() API functionality, including the ability to return model forward scores. The only behavioral difference that you may notice is that we only produce scores for the final token in the prompt (rather than a score for each prompt token).

Here is an example of how to use this wrapper to access the model forward scores:

# Model config object
config = ...

# Create your Neuron model
neuron_model = ... 
# Compile you Neuron model
neuron_model.to_neuron()

# Create the Hugging Face wrapper model
neuron = HuggingFaceGenerationModelAdapter(config, neuron_model)

# Run inference using the Hugging Face generate API
# Pass in `output_scores=True, return_dict_in_generate=True` to return the scores
result = neuron.generate(inputs, ..., output_scores=True, return_dict_in_generate=True)

# Retrieve the tokens
tokens = result.sequences

# Retrieve the scores
scores = result.scores

For additional information about the HuggingFaceGenerationModelAdapter wrapper, you can visit this documentation.

Let me know if this solves the original issue.

michaelfeil commented 5 months ago

@jluntamazon Thanks for your response! My issue was more directed to get the whole sequences logits, specifically to estimate the metrics for lm-eval-harness.

hannanjgaws commented 2 months ago

Hi @michaelfeil:

We added the ability to return all input prompt context encoding logits in the 2.19 Release. This is enabled by setting output_all_logits=True in the NeuronConfig during Neuron model initialization.

Please note that the model.sample() and HuggingFaceGenerationModelAdapter.generate() APIs do not yet support returning all context encoding logits. For now, you must call the Neuron model directly to return the context encoding logits.

Here is an example of how to use output_all_logits=True to access the logits for all input tokens:

import torch
from transformers_neuronx import NeuronAutoModelForCausalLM, NeuronConfig

# Original model checkpoint location
checkpoint = ...

# Create your Neuron model with output_all_logits=True to return all logits during inference
neuron_model = NeuronAutoModelForCausalLM.from_pretrained(
    checkpoint,
    ...,
    neuron_config = NeuronConfig(..., output_all_logits=True)
)

# Compile your Neuron model
neuron_model.to_neuron()

# Prepare your inputs
input_ids = ...
_, context_length = input_ids.shape
cache_ids = torch.arange(0, context_length, dtype=torch.int32)
start_ids = torch.zeros(1, dtype=torch.int32)

# Perform context encoding and return all logits for each input token
logits = neuron_model(input_ids, cache_ids, start_ids)

Please let us know if this provides the behavior you are looking for.

aws-neuron / transformers-neuronx