TakeLab / spacy-udpipe

spaCy + UDPipe
MIT License
159 stars 11 forks source link

[E190] Token head out of range in `Doc.from_array()` for token index '1' with value '11' #23

Closed andyweizhao closed 3 years ago

andyweizhao commented 4 years ago

Hi, I encountered an error when running the following codes: s="ou visitez la page mondiale d'accueil Wide Web du GAO à" udpipe = spacy_udpipe.load('fr') tokens, feats = udpipe(s)

My environment is: spacy (2.2.4) spacy-udpipe (0.1.0)

Thanks!

asajatovic commented 4 years ago

Hi!

Please upgrade spacy-udpipe to at least v.0.2.1 (preferably the latest version). Also, note that spacy_udpipe.load returns an instance of UDPipeLanguage class that follows the spacy.Language API. Once called, that instance returns a single Doc object.

dimitarsh1 commented 4 years ago

Hi

I have the same issue sometimes. I have spacy 2.2.4 and spacy-udpipe 0.3.0.

Can it be related to the size of the data?

Cheers, Dimitar

asajatovic commented 4 years ago

Hi Dimitar,

It would be very helpful if you could provide a code snippet and the input text that cause this issue.

Cheers, Antonio

dimitarsh1 commented 4 years ago

Hi Antonio,

Following is a snippet.

    ...
    docs = list(nlpD.pipe(sentences, n_process=-1))
    with open(system_name + ".spacy_udpipe.model", "wb") as SpUpM:
        pickle.dump(docs, SpUpM)
    print("Model built from scratch")

    nlps = []        
    [nlps.extend(doc) for doc in docs]
    lemmas = {}

    for token in nlps:
        lemma=token.lemma_    
        ...

Unfortunately I cannot share the text as it is around 2M sentences and is more than 250MB (utf-8 encoded).

Thanks a lot for looking into this. Kind regards, Dimitar

dimitarsh1 commented 4 years ago

Hi,

Sometimes it's annoying when some bugs are not actually bugs... but when you combine multiple tools together and have large data sets... and use multithreading ...

I used spacy_udpipe with no multiprocessing, displaying all sentences and their indexes. And identified where the problem was - an empty line. It causes spacy to crash.

I should have thought earlier, but as I said ... too many things were going on and I overlooked this.

Hope this helps. And maybe somewhere there should be a try/except.

I used sentences = [s for s in ifh.readlines() if s]

Now it works just fine with multiprocessing too :+1:

@andyweizhao can you check if that is the case for you too?

Cheers, Dimitar