emanjavacas / pie

A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.
MIT License
22 stars 10 forks source link

Dimension out of range #7

Closed PonteIneptique closed 5 years ago

PonteIneptique commented 5 years ago

Hey there, when I train with the latest on my old dataset, I run quite quickly into this issue :

2018-11-21 11:11:35,279 : Starting epoch [1]
2018-11-21 11:11:36,299 : Batch [10/227] || lemma:3.975   || 5027 w/s
2018-11-21 11:11:36,981 : Batch [20/227] || lemma:2.957   || 6973 w/s
2018-11-21 11:11:37,679 : Batch [30/227] || lemma:2.728   || 7075 w/s
2018-11-21 11:11:38,342 : Batch [40/227] || lemma:2.601   || 7246 w/s
Traceback (most recent call last):
  File "train.py", line 165, in <module>
    scores = trainer.train_epochs(settings.epochs, devset=devset)
  File "/home/thibault/dev/pie/pie/trainer.py", line 341, in train_epochs
    self.train_epoch(devset, epoch)
  File "/home/thibault/dev/pie/pie/trainer.py", line 292, in train_epoch
    for b, batch in enumerate(self.dataset.batch_generator()):
  File "/home/thibault/dev/pie/pie/data/dataset.py", line 509, in batch_generator
    yield from self.prepare_buffer(buf, return_raw=return_raw)
  File "/home/thibault/dev/pie/pie/data/dataset.py", line 486, in prepare_buffer
    packed = self.pack_batch(batch, **kwargs)
  File "/home/thibault/dev/pie/pie/data/dataset.py", line 466, in pack_batch
    return pack_batch(self.label_encoder, batch, device or self.device)
  File "/home/thibault/dev/pie/pie/data/dataset.py", line 522, in pack_batch
    word = torch_utils.pad_batch(word, label_encoder.word.get_pad(), device=device)
  File "/home/thibault/dev/pie/pie/torch_utils.py", line 169, in pad_batch
    output[0:lengths[i], i].copy_(
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

I have had the same issue with morph and reduced as much as possible the tasks to make sure the issues was not in my attempt at configuring other tasks. Let me know if you need anything else.

emanjavacas commented 5 years ago

Hi. I am sure this is because of an empty field in your data. Probably a lemma has 0 length. Make sure this is not the case (all lines having same number of fields)

PonteIneptique commented 5 years ago

Yup, sorry for this. Indeed some of the data were empty. Somehow this was not caught by previous versions or I was running it badly before.

PonteIneptique commented 5 years ago

For some reasons, some of my files were corrupted at some point. Sorry for not seeing that before opening the issue.

PonteIneptique commented 5 years ago

So, I digged a little further, the LineParser seems to issue an empty sentence ([], {'lemma': [], 'pos': [], 'morph': []}) which I am currently actively trying to track down in my data (but which I am currently failing to do...)

PonteIneptique commented 5 years ago

So my regex failed and I had actually double empty lines in someplaces.

I wonder if you'd be interested to warn people in this kind of situation, telling them line Z is screwing up if one sentence is empty ? I did that doing


                # break
                if not line:
                    if len(parser.inp) == 0:
                        print("Line {} is breaking everything".format(line_num))
                    yield parser.inp, parser.tasks
                    parser.reset()
                    continue

which was pretty useful. Some sanity check here would be probably cool (like yield only if len(parser.inp) :) ) You know, for people having poorly formatted data... ;)

emanjavacas commented 5 years ago

try removing the first empty line. I should add check on line breaks whether there sentence is empty

On Wed, 21 Nov 2018, 14:09 Thibault Clérice, notifications@github.com wrote:

So, I digged a little further, the LineParser seems to issue an empty sentence ([], {'lemma': [], 'pos': [], 'morph': []}) which I am currently actively trying to track down in my data (but which I am currently failing to do...)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/mikekestemont/pie/issues/7#issuecomment-440656617, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6HoxsbylykOALKH9ukWsT5NJf9eLu4ks5uxVCTgaJpZM4Yszh8 .

PonteIneptique commented 5 years ago

try removing the first empty line. I should add check on line breaks whether there sentence is empty

In French : "Les grands esprits se rencontrent" (which is a pretty pedantic thing to say now that I think about it ;) )