NVIDIA / RULER

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
Apache License 2.0
738 stars 47 forks source link

Prediction format during evals #30

Closed karansaxena closed 5 months ago

karansaxena commented 5 months ago

Hi Authors,

I am trying hard to understand the post-processing format before evaluation step. For a particular case I am looking at, the prediction (LLM output) is a string, for eg: '1. arthur 2. kilt 3. fire 4. meter 5. appliance 6. behalf 7. forest 8. activity 9. authenticity 10. ferret' The corresponding ground-truth is ['appliance', 'meter', 'forest', 'ferret', 'kilt', 'behalf', 'fire', 'activity', 'arthur', 'authenticity']

When I do the string_match_all function, the output is 0.31. This does not look right.

Specifically https://github.com/hsiehjackson/RULER/blob/main/scripts/eval/synthetic/constants.py#L29 this line is doing a zip between a string and a list, which would be a character-wise zip.

Where am I going wrong?

hsiehjackson commented 5 months ago

Hi, the inputs of string_match_all() are all sample predictions [str, str, ...] and reference answers [[str, str, ...], [str, str, ...], ...]. Here is the line we use this metric function.

karansaxena commented 5 months ago

Sorry, I did not follow. The prediction is the text output from the LLM, right? (I don't think that is being postprocessed into a list]

The ground-truth is a the list of correct terms, as shown in the example above, right?

Basically, what I am missing is the step where/how we are converting the LLM text output to [str, str, ...]

Thanks for your help.

karansaxena commented 5 months ago

Edit - I am talking about ONE datapoint. I am trying to run this for a single test case and a single task.

hsiehjackson commented 5 months ago

Yeap, I know you are talking about one datapoint. However, the design for string_match_all() is used to evaluate multiple data points which preds=[str, str, ...] and refs=[[str, str, ...], [str, str, ...]]. So in your case, your input should be a list like the following:

preds = ['1. arthur 2. kilt 3. fire 4. meter 5. appliance 6. behalf 7. forest 8. activity 9. authenticity 10. ferret']
refs = [['appliance', 'meter', 'forest', 'ferret', 'kilt', 'behalf', 'fire', 'activity', 'arthur', 'authenticity']]
karansaxena commented 5 months ago

Ah, got it. I adapted the code for my use but missed the point where the inputs are single-vs-batched. Resolved.

Thanks for the help!