aws / fmeval

Foundation Model Evaluations Library
http://aws.github.io/fmeval
Apache License 2.0
185 stars 42 forks source link

feat: update GetModelResponse transform to support multiple model invocations on the same input #220

Closed danielezhu closed 6 months ago

danielezhu commented 6 months ago

Description of changes: This PR updates GetModelResponse so that it supports the semantic robustness use case where a model is invoked on the same input multiple times.

Currently, the instance attribute input_to_output_keys maps a model input key to a list representing the model response keys (where a response is of the form (model_output, log_probability) where both model_output and log_probability are optional).

This PR changes input_to_output_keys such that a model input key is mapped to a list of tuples, where each tuple is of the form (model_output, log_probability).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

lucfra commented 6 months ago

I see that actually we would need to change another part of the code. but still I think it's worthwhile at least having two different transforms if we don't want to change the predict method now

danielezhu commented 6 months ago

I see your point about having a separate transform for getting log probabilities, but note that GetModelResponse doesn't do extra computation to get the log probabilities. The model outputs and log probabilities are returned when we invoke the model runner's predict method, and GetModelResponse simply does a tiny bit of processing to parse the response payload. The time taken to execute the code that parses the response is trivial in comparison with the latency of the predict method.

lucfra commented 6 months ago

The implementation of input log probability and generation are model/framework dependent. For example see this https://github.com/aws/fmeval/blob/main/examples/custom_model_runner_hf.ipynb for the HF model runner. There, the model is called twice. So are you referring to Jumpstart models? Also many models like those in Bedrock don't return the log prob so I think it would be much cleaner to have a NotImplemented or some similar error rather than returning a tuple with a None. Also note that for stereotyping having the two things together may result in generating output that is unused (and this is computationally expensive).

Finally, it's pretty standard for other eval frameworks to separate the two. See the base class LM here https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/model.py#L275. Our log prob is called likelihood_rolling there https://github.com/EleutherAI/lm-evaluation-harness/blob/c7b03ad404784c8fb112eabf737b27829cbf0db8/lm_eval/api/model.py#L57. We don't have an equivalent of their likelihood function yet and I think it would be very nice to have in the future for unlocking some evals. Overall carrying around tuples rather than having separate transforms is uncomfortable and unclear and not much justified from usage perspective.

I'm fine to change this later on, but let's do it at a certain time