gchrupala / morfette

Supervised learning of morphology
BSD 2-Clause "Simplified" License
28 stars 5 forks source link

morfette train error #29

Closed Andres-Chandia closed 5 years ago

Andres-Chandia commented 5 years ago

bin/morfette train data/md/training-file.txt data/md/model/ morfette: GramLab.Morfette.Token.readDouble: "input does not start with a digit" CallStack (from HasCallStack): error, called at src/GramLab/Morfette/Token.hs:68:22 in main:GramLab.Morfette.Token

What should I do? if anything?

gchrupala commented 5 years ago

Looks like your input is misformatted.

Andres-Chandia commented 5 years ago

Why... the error says that input does not start with digit, but the readme says: Morfette expects both training and testing data to be tokenized and split into sentences. The format of training data look like this: Gómez Gómez np0000p sostiene sostener vmip3s0

No digit starting the file...

And mine looks like this: iñche iñche PP000 ñi ñi SP000 ñuke ñuke NN000 ngümay ngümay IV00000000000000000000000000000000+IND@üy4+3@Ø30000 . . PCT

gchrupala commented 5 years ago

I suspect that somewhere in your input there is a line with 4 fields instead of 3, and morfette is trying to parse the additional field as the optional format with a word embedding input.

Andres-Chandia commented 5 years ago

No, I've just checked and there is no forth column anywhere, I give you some more context, could it be a blank line?

iñche iñche PP000 ñi ñi SP000 ñuke ñuke NN000 ngümay ngümay IV00000000000000000000000000000000+IND@üy4+3@Ø30000 . . PCT

lelifin leli TV000000000000000000000000000000+EDO@fi600+IND1SG@ün30000 ñi ñi SP000 ñuke ñuke NN000 . . PCT

lelin leli TV000000000000000000000000000000000+IND1SG@ün30000 ñi ñi SP000 ñuke ñuke NN000 . . PCT

gchrupala commented 5 years ago

I can't reproduce the error with this sample, it works fine for me. If can share the whole input file I could have a look.

Andres-Chandia commented 5 years ago

Here you have it:

mapudungun_training-file.txt

gchrupala commented 5 years ago

Line 5516 of this file contains four fields:

en pu. en LOC

Please note that any sequence of whitespace is treated as a field separator.

Andres-Chandia commented 5 years ago

Ok, thanks and sorry; that's a dot that shouldn't have had space, I mean it should have been pu.en, and you are right I was only looking for tab separators forgetting the spaces... sorry again

gchrupala commented 5 years ago

No problem, happy to have helped.