Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
312 stars 39 forks source link

preprocess.sh [: ==: unary operator expected #93

Closed thammegowda closed 1 year ago

thammegowda commented 1 year ago

https://github.com/Helsinki-NLP/OPUS-MT-train/blob/3cd0bd3f75e4d2b4b0483f7b53be4722066b7d33/scripts/preprocess-bpe.sh#L9-L17

Some nodes don't have domain name set, and as a result we get empty string from hostname -d . This leads to [: ==: unary operator expected" [1].

Recommend fix: use quotes i.e., "$(hostname -d)" to avoid this error.

If anyone reached here while trying to run pretrained models, here is a quickfix via sed

 sed -i.bak 's/\[ `\([^`]*\)` /[ "\$(\1)" /g'  path/to/model-dir/preprocess.sh

And the diff

8c8
< if [ `hostname -d` == "bullx" ]; then
---
> if [ "$(hostname -d)" == "bullx" ]; then
12c12
< elif [ `hostname -d` == "csc.fi" ]; then
---
> elif [ "$(hostname -d)" == "csc.fi" ]; then

[1] https://stackoverflow.com/questions/13617843/unary-operator-expected-error-in-bash-if-condition

thammegowda commented 1 year ago

Another issue: https://github.com/Helsinki-NLP/OPUS-MT-train/blob/ce038869e152e8a08bcb11454399b0dc8f01adb2/scripts/preprocess-txt.sh#L43

$ echo -e "line1\nline2\nline3"
line1
line2
line3

$ echo -e "line1\nline2\nline3" | perl -C -pe 's/\p{C}/ /g;'
line1 line2 line3

I think we need to preserve line breaks in pre and post processing.

jorgtied commented 1 year ago

Good points. Fixed the hostname issue above (but the script will still appear in many of the released model packages, unfortunately). The second issue was already fixed in preprocess-spm.sh but now I also pushed a fix to the preprocess-txt.sh script.