Reasons for evaluating the same content but with different results each time.

I used the same model, ofa_cn_ocr_base.pt, the model you provided to evaluate the same content but the results were different each time.

linux, python3.7.0, torch1.10.0+cu102

CUDA_VISIBLE_DEVICES=6 python3 evaluate.py \
    ${data} \
    --path=${path} \
    --user-dir=${user_dir} \
    --task=ocr \
    --bpe bert \
    --batch-size=1 \
    --log-format=simple `--log-interval=10` \
    --seed=7 \
    --gen-subset=${split} \
    --results-path=${result_path} \
    --beam=5 \
    --unnormalized \
    --num-workers=0 \
    --model-overrides="{\"data\":\"${data}\",\"bpe_dir\":\"${bpe_dir}\"}"

I noticed that the gen_out of the function eval_ocr is no longer consistent and seems to start with encoder_outs in models/sequence_generator.py. May I ask what part of the model causes unstable inference? Can stable results be obtained for ocr tasks? How should they be implemented? Thanks in advance.

OFA-Sys / OFA

Reasons for evaluating the same content but with different results each time. #404