IINemo / lm-polygraph

MIT License
111 stars 21 forks source link

Possible mismatching max_length and max_new_tokens in example eval script #118

Open kirill-fedyanin opened 11 months ago

kirill-fedyanin commented 11 months ago

I was running the polygraph_eval with this example config https://github.com/IINemo/lm-polygraph/blob/main/examples/configs/polygraph_eval_wmt14_ende.yaml

I got a warning about too long string

It didn't failed but i almost sure it would mess up with the results Screenshot from 2023-10-09 11-11-07

My wild guess is that stat calculators max_lengths in not connected to max_generated_tokens in ue manager itself. But i didn't really look into it for now

kirill-fedyanin commented 11 months ago

Also, wonder why we switching from max_new_tokens to just max_tokens here

https://github.com/IINemo/lm-polygraph/blob/00851972311db837de6358c90dac65616495d4de/src/lm_polygraph/utils/model.py#L69

rvashurin commented 11 months ago

Also, wonder why we switching from max_new_tokens to just max_tokens here

https://github.com/IINemo/lm-polygraph/blob/00851972311db837de6358c90dac65616495d4de/src/lm_polygraph/utils/model.py#L69

Probably to comply with openAI API format? @cant-access-rediska0123 can you comment on that?

rvashurin commented 11 months ago

I was running the polygraph_eval with this example config https://github.com/IINemo/lm-polygraph/blob/main/examples/configs/polygraph_eval_wmt14_ende.yaml

I got a warning about too long string

It didn't failed but i almost sure it would mess up with the results Screenshot from 2023-10-09 11-11-07

My wild guess is that stat calculators max_lengths in not connected to max_generated_tokens in ue manager itself. But i didn't really look into it for now

I'm pretty sure this is due to this line here: https://github.com/IINemo/lm-polygraph/blob/main/src/lm_polygraph/utils/model.py#L166

For some we reason we set a very low limit on the input sequence length, which triggers the warning during generation. It doesn't fail though, because in reality the model (Llama2 for example) is capable of handling much longer sequences than measly 256 tokens.

Not sure why weird even set this value, probably should remove it, along with all other code for loading and configuring whitebox models.

IINemo commented 1 month ago

@rvashurin check please