Closed Andres-Chandia closed 5 years ago
Looks like your input is misformatted.
Why... the error says that input does not start with digit, but the readme says: Morfette expects both training and testing data to be tokenized and split into sentences. The format of training data look like this: Gómez Gómez np0000p sostiene sostener vmip3s0
No digit starting the file...
And mine looks like this: iñche iñche PP000 ñi ñi SP000 ñuke ñuke NN000 ngümay ngümay IV00000000000000000000000000000000+IND@üy4+3@Ø30000 . . PCT
I suspect that somewhere in your input there is a line with 4 fields instead of 3, and morfette is trying to parse the additional field as the optional format with a word embedding input.
No, I've just checked and there is no forth column anywhere, I give you some more context, could it be a blank line?
iñche iñche PP000 ñi ñi SP000 ñuke ñuke NN000 ngümay ngümay IV00000000000000000000000000000000+IND@üy4+3@Ø30000 . . PCT
lelifin leli TV000000000000000000000000000000+EDO@fi600+IND1SG@ün30000 ñi ñi SP000 ñuke ñuke NN000 . . PCT
lelin leli TV000000000000000000000000000000000+IND1SG@ün30000 ñi ñi SP000 ñuke ñuke NN000 . . PCT
I can't reproduce the error with this sample, it works fine for me. If can share the whole input file I could have a look.
Here you have it:
Line 5516 of this file contains four fields:
en pu. en LOC
Please note that any sequence of whitespace is treated as a field separator.
Ok, thanks and sorry; that's a dot that shouldn't have had space, I mean it should have been pu.en, and you are right I was only looking for tab separators forgetting the spaces... sorry again
No problem, happy to have helped.
bin/morfette train data/md/training-file.txt data/md/model/ morfette: GramLab.Morfette.Token.readDouble: "input does not start with a digit" CallStack (from HasCallStack): error, called at src/GramLab/Morfette/Token.hs:68:22 in main:GramLab.Morfette.Token
What should I do? if anything?