Closed michaeltremeer closed 8 months ago
I think a more suitable approach would be to add another metric that measures actual generated tokens such that, with each report, user can see "attempted" and "actual". Then they can decide about the relevance of the overall test results based on their use-cases.
Need a change of approach to a metrics output one.
Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?
Need a change of approach to a metrics output one.
Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?
I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing gen_tpm
?
Need a change of approach to a metrics output one.
Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?
I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing
gen_tpm
?
I'm not sure I understand - can you explain what you mean?
Need a change of approach to a metrics output one.
Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?
I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing
gen_tpm
?I'm not sure I understand - can you explain what you mean?
I'm suggesting we add 3 new aggregate metrics in the sliding window that represent p10, p90 and avg actual tokens generated instead of your proposal. This way users will have comparable metrics to theoritical gen_tpm
based on max_tokens
.
Need a change of approach to a metrics output one.
Those metrics should be part of the stats output, let me know what you think of those (I added 10th//90th/avg), happy to remove some or rename them. I do think a warning is useful though (it's easy to ignore if you're just running a single test and haven't dug into the tool). Maybe a single one after the test is finished/terminated?
I think p10, p90 and avg should be fine. I'm wodndering though if it is better to make an aggregate in the sliding window rather than per request? this way it would match, and be comparable to, existing
gen_tpm
?I'm not sure I understand - can you explain what you mean?
I'm suggesting we add 3 new aggregate metrics in the sliding window that represent p10, p90 and avg actual tokens generated instead of your proposal. This way users will have comparable metrics to theoritical
gen_tpm
based onmax_tokens
.
Practically speaking when I run this with the existing contexts, I did not see the number change. It is almost always a const value. Does it make sense to show the same number 3 times? I would think avg would be sufficient maybe a 95th but I dont see the benefit.
Periodically (every 10s) alerts users when the average number of generated tokens is less than 90% of max_tokens. This is important if users are using the tool to estimate total requests per second based on their real-world context/response sizes, but the models are returning considerably shorter responses in testing (due to different/automated prompts).