Azure / azure-openai-benchmark

Azure OpenAI benchmarking tool
MIT License
130 stars 54 forks source link

Add tpr stats output, and periodic warning when gen_tpr < 90% of max_tokens #14

Closed michaeltremeer closed 8 months ago

michaeltremeer commented 11 months ago

Periodically (every 10s) alerts users when the average number of generated tokens is less than 90% of max_tokens. This is important if users are using the tool to estimate total requests per second based on their real-world context/response sizes, but the models are returning considerably shorter responses in testing (due to different/automated prompts).

technicianted commented 10 months ago

I think a more suitable approach would be to add another metric that measures actual generated tokens such that, with each report, user can see "attempted" and "actual". Then they can decide about the relevance of the overall test results based on their use-cases.

michaeltremeer commented 10 months ago

Need a change of approach to a metrics output one.

Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?

technicianted commented 10 months ago

Need a change of approach to a metrics output one.

Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?

I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing gen_tpm?

michaeltremeer commented 10 months ago

Need a change of approach to a metrics output one.

Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?

I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing gen_tpm?

I'm not sure I understand - can you explain what you mean?

technicianted commented 10 months ago

Need a change of approach to a metrics output one.

Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?

I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing gen_tpm?

I'm not sure I understand - can you explain what you mean?

I'm suggesting we add 3 new aggregate metrics in the sliding window that represent p10, p90 and avg actual tokens generated instead of your proposal. This way users will have comparable metrics to theoritical gen_tpm based on max_tokens.

yshahin commented 9 months ago

Need a change of approach to a metrics output one.

Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?

I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing gen_tpm?

I'm not sure I understand - can you explain what you mean?

I'm suggesting we add 3 new aggregate metrics in the sliding window that represent p10, p90 and avg actual tokens generated instead of your proposal. This way users will have comparable metrics to theoritical gen_tpm based on max_tokens.

Practically speaking when I run this with the existing contexts, I did not see the number change. It is almost always a const value. Does it make sense to show the same number 3 times? I would think avg would be sufficient maybe a 95th but I dont see the benefit.