Different results for different batch sizes when evaluating trained models

AxelMueller commented 5 years ago

Hi, First of all, thanks for making your great code and models available. I am currently trying out two of your models (MP-CNN and VDPWI) and noticed that when evaluating trained models (via --skip-training), different batch sizes give different results. For example,

python -m mp_cnn ../Castor-models/mp_cnn/mpcnn.sick.model --dataset sick --batch-size 16 --skip-training

returns a different results than

python -m mp_cnn ../Castor-models/mp_cnn/mpcnn.sick.model --dataset sick --batch-size 64 --skip-training

Have your encountered this behavior before and do you know what the reasons might be? Which would be the correct result?

daemon commented 5 years ago

Hi,

Thanks for your interest, I've confirmed this issue. My guess is that the amount of padding depends on the batch size due to varying sentence lengths, and the resulting padding is not implemented as a no-op. Using a batch size of 1 should be the correct thing to do during inference (for now).

AxelMueller commented 5 years ago

Ok, thanks for your quick reply!

castorini / castor

Different results for different batch sizes when evaluating trained models #176