Currently, the library does not allow sentences that contain newline characters (i.e. '\n') but rather will split them into subsentences to compute scores. This is due to how the input sentences are read (e.g. see here for the scorer code). A better way to achieve this would be to simply read the files as binary and then apply decoding to the individual lines. Happy to contribute with a small PR if you feel like this might be useful to other users.
@ricardorei
To Reproduce
Simply executing COMET on the input files, either via scoring or compare.
Expected behaviour
If I have a file that consists of 1000 lines (i.e. wc -l output_it/src.txt), I would expect exactly 1000 sentence-level scores.
Environment
OS: Ubuntu 20.04.5 LTS (Focal Fossa)
Python 3.8.16 via Conda
🐛 Bug
Currently, the library does not allow sentences that contain newline characters (i.e. '\n') but rather will split them into subsentences to compute scores. This is due to how the input sentences are read (e.g. see here for the scorer code). A better way to achieve this would be to simply read the files as binary and then apply decoding to the individual lines. Happy to contribute with a small PR if you feel like this might be useful to other users.
@ricardorei
To Reproduce
Simply executing COMET on the input files, either via scoring or compare.
Expected behaviour
If I have a file that consists of 1000 lines (i.e.
wc -l output_it/src.txt
), I would expect exactly 1000 sentence-level scores.Environment
OS: Ubuntu 20.04.5 LTS (Focal Fossa) Python 3.8.16 via Conda