Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

Should have a clean or clean-data Makefile target #23

Open Traubert opened 4 years ago

Traubert commented 4 years ago

It can happen that data directories get into a broken state which breaks building the data target. It would be useful to have a "clean" target to clean data, models, or both.

eg. if files in /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/ end up empty, it happens that:

for d in /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/GNOME.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/KDE4.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/OpenSubtitles.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/QED.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/Ubuntu.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/wikimedia.br-en.clean.br.gz; do \
  l=`/projappl/project_2001194/bin/pigz -cd < $d  | wc -l`; \
  if [ $l -gt 0 ]; then \
    echo "$d" | xargs basename | \
    sed -e 's#.br.gz$##' \
    -e 's#.clean$##'\
    -e 's#.br-en$##' | tr "\n" ' '         >> /local_scratch/hardwick/br-en/train/README.md; \
    echo -n "($l) "                                  >> /local_scratch/hardwick/br-en/train/README.md; \
  fi \
done
pigz: skipping: <stdin> empty
pigz: skipping: <stdin> empty
pigz: skipping: <stdin> empty
pigz: skipping: <stdin> empty
pigz: skipping: <stdin> empty
pigz: skipping: <stdin> empty
echo ""                                               >> /local_scratch/hardwick/br-en/train/README.md
echo "only one target language"
only one target language
zcat /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/GNOME.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/KDE4.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/OpenSubtitles.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/QED.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/Ubuntu.br-en.clean.br.gz /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/wikimedia.br-en.clean.br.gz  > /local_scratch/hardwick/br-en/train/opus.src.br-en.src

gzip: /scratch/clarin/hardwick/OPUS-MT-train/work/data/simple/GNOME.br-en.clean.br.gz: unexpected end of file
make[1]: *** [add-to-local-train-data] Error 1
make[1]: Leaving directory `/scratch/clarin/hardwick/OPUS-MT-train'

I was able to fix this by

rm -rf work/data/simple