awslabs / sockeye

Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
https://awslabs.github.io/sockeye/
Apache License 2.0
1.21k stars 323 forks source link

tok/sec throughput #1073

Closed vince62s closed 1 year ago

vince62s commented 2 years ago

Hi,

Should the number of tokens in a batch be either source or target without the padding ? https://github.com/awslabs/sockeye/blob/main/sockeye/data_io.py#L1948 If this is the number to calculate the throughput, might be really off.

mjdenkowski commented 2 years ago

Whether or not to include padding tokens when computing throughput is an interesting question. We choose to include padding so that tokens per second (tok/sec) reflects the total number of indices processed by the GPU/CPU. For throughput metrics that aren't affected by padding, we can look at sentences per second (s/sec) and updates per second (u/sec).

vince62s commented 2 years ago

hmmm not even. You can have in a batch 25 sentences of very long sentences or 1000 sentences of very short ones, so not very useful either. u/sec is more meaningful IF and only IF the batching method is similar. But you can say "I use 5000 token batches" but in reality have an average of 3000 or 3500 real tokens which is not the same depending how you bucket the batches. Both in speed and in actual data processed it will differ. In the end you may have to process more updates to go through the same amount of data. (just my 50 cents on trying to compare apples to apples).

mjdenkowski commented 1 year ago

Thanks again for your feedback.