Evaluation script gives error about mismatch of line numbers

I'm running the evaluation script on my en-de system according to the steps in the LREC2020 directory. This test set has 35315 examples which is how many I've translated:

$ wc -l my_out.de.tok src_segmented.txt <(zcat en-de.test.txt.gz)
   35315 ms.de.tok
   35315 src_segmented.txt
   35315 /dev/fd/63
  105945 total

However, the eval script tells me that my lines mismatch:

$ cat eval.sh
#!/usr/bin/env bash

python3 evaluate.py \
  --ref-testsuite en-de.test.txt.gz \
  --sense-file senses.en-de.txt \
  --dist-file distances.en-de.txt \
  --src-segmented src_segmented.txt \
  --tgt-segmented my_out.de.tok \
  --tgt-lemmatized my_out.de.conllu
$
$ bash eval.sh
Number of sentences does not match
Reference file: 35315
Segmented source file: 43481
Lemmatized system output: 43481
Segmented system output: 43481

When I print line before this message, I see defaultdict(<class 'int'>, {'total': 43481, 'missing_ref': 8166}), but there don't seem to be any missing refs:

$ zcat en-de.test.txt.gz | cut -f5 | grep  "^$" | wc -l
0

Is there something obviously wrong here?

Helsinki-NLP / MuCoW

Evaluation script gives error about mismatch of line numbers #5