[QUESTION] large file scoring

Unbabel / COMET

A Neural Framework for MT Evaluation

https://unbabel.github.io/COMET/html/index.html

Apache License 2.0

441 stars 72 forks source link

[QUESTION] large file scoring #206

Closed vince62s closed 3 months ago

vince62s commented 4 months ago

When scoring a large file (say > 100K records) why does it start with a high throughput , for instance say 50 it/sec, and quickly after a few 10K records it drops significantly (more than half)

Thanks

vince62s commented 4 months ago

could it be the same as: https://github.com/Unbabel/COMET/issues/158

ricardorei commented 3 months ago

Training is typically influenced by various factors, but for inference, batch sorting is employed to minimize padding. Consequently, the longest batches end up being processed in the end resulting in a higher number of tokens per batch compared to the beginning.

ricardorei commented 3 months ago

you can check the difference by setting length_batching to False