EdinburghNLP / nematus

Open-Source Neural Machine Translation in Tensorflow
BSD 3-Clause "New" or "Revised" License
797 stars 269 forks source link

DataLossError (see above for traceback): Unable to open table file /wmt16_systems/en-de/model.npz: Data loss: not an sstable #88

Closed simonefrancia closed 5 years ago

simonefrancia commented 5 years ago

Hi, I am trying to use pretrained model en-de from (http://data.statmt.org/rsennrich/wmt16_systems/ ) and translate english sentence with this script:

# this sample script translates a test set, including
# preprocessing (tokenization, truecasing, and subword segmentation),
# and postprocessing (merging subword units, detruecasing, detokenization).

# instructions: set paths to mosesdecoder, subword_nmt, and nematus,
# then run "./translate.sh < input_file > output_file"

# suffix of source language
SRC=en

# suffix of target language
TRG=de

# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=../../mosesdecoder

# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=../../subword-nmt

# path to nematus ( https://www.github.com/rsennrich/nematus )
nematus=../../nematus

# theano device
device=cpu

# preprocess
$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \
$mosesdecoder/scripts/tokenizer/tokenizer.perl -l $SRC -penn | \
$mosesdecoder/scripts/recaser/truecase.perl -model truecase-model.$SRC | \
$subword_nmt/apply_bpe.py -c $SRC$TRG.bpe | \
# translate
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,on_unused_input=warn python $nematus/nematus/translate.py \
     -m model.npz \
     -k 12 -n 
#-n -p 1 --suppress-unk | \
# postprocess
sed 's/\@\@ //g' | \
$mosesdecoder/scripts/recaser/detruecase.perl | \
$mosesdecoder/scripts/tokenizer/detokenizer.perl -l $TRG

When I execute ./translate.sh < en_text.txt > output.txt, I got this error:

DataLossError (see above for traceback): Unable to open table file /wmt16_systems/en-de/model.npz: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
     [[{{node model0/save/RestoreV2}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_model0/save/Const_0_0, model0/save/RestoreV2/tensor_names, model0/save/RestoreV2/shape_and_slices)]]

ERROR: Translate worker process 600 crashed with exitcode 1
Warning: No built-in rules for language de.
Detokenizer Version $Revision: 4134 $
Language: de

Could you give me any suggest? Thanks

pjwilliams commented 5 years ago

Hi,

earlier this year, Nematus switched from using the Theano toolkit to using TensorFlow. It looks like you're trying to use the TensorFlow version of Nematus (i.e. the current master) with a Theano model. You have a couple of options: one is to convert the model from Theano format to TensorFlow format using a conversion script that comes with Nematus. The command should look something like this:

CUDA_VISIBLE_DEVICES= python $nematus_home/nematus/theano_tf_convert.py \ --from_theano \ --in model.l2r.ens1.npz \ --out tf-model.l2r.ens1

The other option is to use the Theano version of Nematus, which is on the 'theano' branch of the repository. Note that this code is no longer actively maintained.

Best wishes, Phil

On 23 Oct 2018, at 10:45, simonefrancia notifications@github.com wrote:

Hi, I am trying to use pretrained model en-de from (http://data.statmt.org/rsennrich/wmt16_systems/ http://data.statmt.org/rsennrich/wmt16_systems/ ) and translate english sentence with this script:

this sample script translates a test set, including

preprocessing (tokenization, truecasing, and subword segmentation),

and postprocessing (merging subword units, detruecasing, detokenization).

instructions: set paths to mosesdecoder, subword_nmt, and nematus,

then run "./translate.sh < input_file > output_file"

suffix of source language

SRC=en

suffix of target language

TRG=de

path to moses decoder: https://github.com/moses-smt/mosesdecoder

mosesdecoder=../../mosesdecoder

path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt

subword_nmt=../../subword-nmt

path to nematus ( https://www.github.com/rsennrich/nematus )

nematus=../../nematus

theano device

device=cpu

preprocess

$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \ $mosesdecoder/scripts/tokenizer/tokenizer.perl -l $SRC -penn | \ $mosesdecoder/scripts/recaser/truecase.perl -model truecase-model.$SRC | \ $subword_nmt/apply_bpe.py -c $SRC$TRG.bpe | \

translate

THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,on_unused_input=warn python $nematus/nematus/translate.py \ -m model.npz \ -k 12 -n

-n -p 1 --suppress-unk | \

postprocess

sed 's/\@\@ //g' | \ $mosesdecoder/scripts/recaser/detruecase.perl | \ $mosesdecoder/scripts/tokenizer/detokenizer.perl -l $TRG

When I execute ./translate.sh < en_text.txt > output.txt, I got this error:

[[{{node model0/save/RestoreV2}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_model0/save/Const_0_0, model0/save/RestoreV2/tensor_names, model0/save/RestoreV2/shape_and_slices)]]

ERROR: Translate worker process 600 crashed with exitcode 1 Warning: No built-in rules for language de. Detokenizer Version $Revision: 4134 $ Language: de Could you give me any suggest? Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/EdinburghNLP/nematus/issues/88, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDaY8mYObRWZRxNNyBWIDYhdSccNkXPks5unuVVgaJpZM4X1J5s.

simonefrancia commented 5 years ago

Thanks for response. In reference to this issue (https://github.com/marian-nmt/marian/issues/219), I am using Nematus models inside Marian-NMT with good results as you can see:

Command:

./marian-decoder \
> --type nematus \
> --models /wmt17_systems/de-en/model.l2r.ens1.npz \
> --vocabs /wmt17_systems/de-en/vocab.de.json /wmt17_systems/de-en/vocab.en.json  \
> --dim-vocabs 74383 51100 \
> --enc-depth 1     \
> --enc-cell-depth 4     \
> --enc-type bidirectional      \
> --dec-depth 1   \
> --dec-cell-base-depth 8  \
> --dec-cell-high-depth 1   \
> --dec-cell gru-nematus \
> --enc-cell gru-nematus   \
> --tied-embeddings true \
> --layer-normalization true

INPUT:

Verbrachte 24 Stunden
Ich brauche mehr Stunden mit dir
Du hast das Wochenende verbracht
Gleich werden, ooh ooh
Wir haben die späten Nächte verbracht
Dinge richtig machen, zwischen uns
Aber jetzt ist alles gut, Baby
Rollen Sie das Backwood-Baby
Und spiel mich in der Nähe

OUTPUT:

UK@@ 24 hours
I need more hours with you
you 've spent the weekend
Vilnius
we 've spent the late nights
things Right unify us
but now everything IS Baby
roll the Atlantic
and play me in the video@@ game me nearby

I have two question:

  1. How many Nematus pretrained models (different languages) can I apply like in this example? Could I apply the same command with same parameters also, for example, to model en->ru (http://data.statmt.org/wmt17_systems/en-ru/) ?

  2. is there also a post-processing task that treats special parts of output like "UK@@" "video@@ "?

Thanks

rsennrich commented 5 years ago
  1. This should work with all 11 language pairs on http://data.statmt.org/wmt17_systems/

  2. each directory has the script postprocess.sh, which performs the necessary post-processing. For example, check http://data.statmt.org/wmt17_systems/de-en/postprocess.sh

simonefrancia commented 5 years ago

Thanks for clear response. Another question: for some languages pairs for which there is no pretrained, I would like to make a train. For this purpose, is it sufficient to follow only these instructions (http://data.statmt.org/wmt17_systems/training/)? In these days I will try to do my train.

Thanks in advance

rsennrich commented 5 years ago

Yes, these instructions should help you train your own model. You may want to change some things, e.g. the preprocessing, depending on the language pair.

simonefrancia commented 5 years ago

Thanks a lot!