UKPLab / coling2016-pcrf-seq2seq

An adaptation of MarMot morphological tagger for generic sequence-to-sequence tasks
11 stars 3 forks source link

Marmot exception "tag_index out of range" #1

Open mbollmann opened 7 years ago

mbollmann commented 7 years ago

When trying to use train_complex.sh on some of my own data, I'm getting this exception from Marmot which I have no idea how to debug:

Exception in thread "main" java.lang.RuntimeException: tag_index out of range: 1 : [feat_1_0=n#feat_1_1=d#feat_1_2=e#feat_1_3=n#feat_1_4=#feat_1_5=-#feat_1_6=-#feat_1_7=-#feat_1_8=-#feat_2_0=nd#feat_2_1=de#feat_2_2=en#feat_2_3=n#feat_2_4=-#feat_2_5=--#feat_2_6=--#feat_2_7=--#feat_3_0=nde#feat_3_1=den#feat_3_2=en#feat_3_3=n-#feat_3_4=--#feat_3_5=---#feat_3_6=---#feat_4_0=nden#feat_4_1=den#feat_4_2=en-#feat_4_3=n--#feat_4_4=---#feat_4_5=----]

        at marmot.morph.io.SentenceReader$1.check_index(SentenceReader.java:125)
        at marmot.morph.io.SentenceReader$1.next(SentenceReader.java:59)
        at marmot.morph.io.SentenceReader$1.next(SentenceReader.java:31)
        at marmot.morph.cmd.Trainer.train(Trainer.java:35)
        at marmot.morph.cmd.Trainer.main(Trainer.java:74)

Using marmot-2015-10-22.jar.

This doesn't happen with the supplied Twitter sample data, FWIW.

SteffenEger commented 7 years ago

"tag_index out of range: 1" indicates that there is no column 1 in an intermediate file that is being generated (see below). Are you sure that your original data is in correct format? Should be tab separation between the input x and output y. Moreover, all characters in x and all characters in y should be separated by ordinary space.

The intermediate file that is being generated for Marmot is being stored in tmp/ This file should have 3 columns: column 0 is input, column 1 is output (this appears to be missing), column 2 is the features.

mbollmann commented 7 years ago

Okay, apparently this was my fault: the training file was in correct format, but I tried to re-start the training process in the middle due to a crash I got (the code silently assumes the existence of a tmp/ subdirectory or it will fail -- the same applies later on to a MODELS_cl/ subdirectory for saving the model). When I started it again from the beginning, I got no such error.

Unfortunately the training fails due to memory issues now. By default, you allocate 120GB heap space to the Java process, which is a dangerous default setting IMO (it bogged down my entire system the first time I ran it). When I changed it to allocate less, Marmot just fails with "OutOfMemoryError". Not sure how to proceed apart from upgrading to huge amounts of RAM...

SteffenEger commented 7 years ago

Good points. We should have documented the necessity of creating the respective subdirectories. I'll update the README in the upcoming days.

Concerning the memory: 120G was not a problem for our machines. Maybe you find a smaller value that actually works for your machines and problem. Alternatively you can use a smaller order and/or smaller context size. Good orders for seq2seq problems seem to be up to 7, but you might get a good system already with orders 2-3.

mbollmann commented 7 years ago

I'll try your suggestions, thanks! (Or maybe try letting it run overnight when I'm not actively using the machine.)