explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

Input array dimensions error when parsing doc with merged entities #2006

Closed kevinrosenberg21 closed 6 years ago

kevinrosenberg21 commented 6 years ago

Your Environment

Info about spaCy

Hi,

As I mentioned in other posts I'm working on a dependency parser that expects to receive an entity-recognized Document.

I'm able to create the training set, and successfully train the model, but I get an error when I try to parse a new text.

I have defined functions for merging the entities, and having the document be a single sentence, available below.

def merge_entities(doc):
    for ent in doc.ents:
        ent.merge(tag=ent.root.tag_, lemma=ent.text, ent_type=ent.label_)
    return doc

def custom_sbd(doc):
    doc[0].sent_start = True
    for i in range(1, len(doc)):
        doc[i].sent_start = False
    #doc.is_parsed = True
    return doc

I insert them in the pipeline with the function I use to load the model

def load_model(model_dir):
    import spacy
    from utils import clean_text
    nlp = spacy.load(model_dir+'/ner')
    nlp.tokenizer = custom_tokenizer(nlp)
    nlp_parser = spacy.load(model_dir+'/parser')
    nlp_parser.tokenizer = custom_tokenizer(nlp_parser)
    def make_doc(txt):
        txt = clean_text(txt)
        doc = nlp(txt)
        doc = merge_entities(doc)
        doc = custom_sbd(doc)
        return doc
    nlp_parser.make_doc = make_doc
    return nlp_parser

This is because I couldn't create a single model, so I ended up creating two, one for NER and one for parsing, and joining them with that function. Not the most elegant solution, but it seems to be working.

When I parse the doc

nlp = load_model(model_dir)
#nlp.disable_pipes('parser')
txt = u"can't post the text I'm using because of privacy issues"
doc1 = nlp(txt)

I get the following error

ValueError: all the input array dimensions except for the concatenation axis must match exactly

With the following stack

File "/home/kevin/anaconda3/lib/python3.5/site-packages/spacy/language.py", line 333, in call doc = proc(doc)

File "nn_parser.pyx", line 341, in spacy.syntax.nn_parser.Parser.call

File "nn_parser.pyx", line 786, in spacy.syntax.nn_parser.Parser.set_annotations

File "doc.pyx", line 851, in spacy.tokens.doc.Doc.extend_tensor

File "/home/kevin/anaconda3/lib/python3.5/site-packages/numpy/core/shape_base.py", line 288, in hstack return _nx.concatenate(arrs, 1)

If I uncomment the #nlp.disable_pipes('parser') line, it works. Similarly, if I comment out the entity merging line (doc = merge_entities(doc)) it also works. Is there something I'm doing wrong or is this a bug?

Thank you

kevinrosenberg21 commented 6 years ago

Hi everyone.

Has anyone been able to check this out? Is it a bug or something I'm doing wrong?

Thanks

kevinrosenberg21 commented 6 years ago

I've spent all afternoon looking at this and the only explanation I can come up with is that when it merges the entities it somehow doesn't re-calculate the doc's tensor, causing inconsistencies. @ines @honnibal is there a method to re-calculate it manually?

kevinrosenberg21 commented 6 years ago

Sorry for the constant commenting, but I checked it out.

doc1 = nlp(txt)
print("Before merging entities the len of the doc is: "  + str(len(doc1)))
print("Before merging entities the shape of the tensor is: "  + str(doc1.tensor.shape))
doc1 = merge_entities(doc1)
print("After merging entities the len of the doc is: "  + str(len(doc1)))
print("After merging entities the shape of the tensor is: "  + str(doc1.tensor.shape))

And the result was

Before merging entities the len of the doc is: 697 Before merging entities the shape of the tensor is: (697, 128) After merging entities the len of the doc is: 648 After merging entities the shape of the tensor is: (697, 128)

By the comments in the code

    The doc.tensor attribute holds dense feature vectors
    computed by the models in the pipeline. Let's say a
    document with 30 words has a tensor with 128 dimensions
    per word. doc.tensor.shape will be (30, 128). After
    calling doc.extend_tensor with an array of hape (30, 64),
    doc.tensor == (30, 192).
kevinrosenberg21 commented 6 years ago

Hi everyone.

I was able to fix the problem by resetting the tensor to the initial value it has in the Doc class before the parser.

It now works with this code:

def load_model(model_dir):
    import spacy, numpy
    from utils import clean_text
    nlp = spacy.load(model_dir+'/ner')
    nlp.tokenizer = custom_tokenizer(nlp)
    nlp_parser = spacy.load(model_dir+'/parser')
    def make_doc(txt):
        txt = clean_text(txt)
        doc = nlp(txt)
        doc = merge_entities(doc)
        doc.tensor = numpy.zeros((0,), dtype='float32')
        doc = custom_sbd(doc)
        return doc
    nlp_parser.make_doc = make_doc
    return nlp_parser
lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.