NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
918 stars 73 forks source link

No attribute spans or no attribute ents #51

Closed surferfelix closed 2 years ago

surferfelix commented 2 years ago

I am working on a project where I am trying to resolve conflicts in named entities, one of my steps involves using skweak.

I am experiencing the following problem.

# List of spacy doc objects; each doc object represents a sentence
docs = prepare_spacy_extensions(sents, labels, headers)
# Applying skweak on each iteration
    for doc in docs:
        piped_doc = list(first_name_detector.pipe(doc))
        skweak.utils.display_entities(piped_doc)

This results in the following error

 File "/Volumes/modules/ML_OVERLAP/skweak_implementation.py", line 116, in <module>
    perform_labeling_functions(tokens, labels, headers)
  File "/Volumes/modules/ML_OVERLAP/skweak_implementation.py", line 109, in perform_labeling_functions
    piped_doc = list(first_name_detector.pipe(doc))
  File "/Users/myuser/opt/anaconda3/envs/bolesian/lib/python3.9/site-packages/skweak/base.py", line 37, in pipe
    yield self(doc)
  File "/Users/myuser/opt/anaconda3/envs/bolesian/lib/python3.9/site-packages/skweak/base.py", line 89, in __call__
    doc.spans[self.name] = []
AttributeError: 'spacy.tokens.token.Token' object has no attribute 'spans'

When I try to perform the same action with the entire doc object, it will return another error,

line 110, in perform_labeling_functions
    skweak.utils.display_entities(piped_doc)
  File "/Users/egelm1/opt/anaconda3/envs/bolesian/lib/python3.9/site-packages/skweak/utils.py", line 734, in display_entities
    spans = doc.ents
AttributeError: 'list' object has no attribute 'ents'

Presumably you want the larger doc object and sentence level docs are not supported? Though with my current system I am unable to aggregate everything into one large doc object. Are there any solutions to this problem?

plison commented 2 years ago

The pipe method is meant to take streams of documents, not a single document. So you need to use it like this:

# List of spacy doc objects; each doc object represents a sentence
docs = prepare_spacy_extensions(sents, labels, headers)
piped_docs = list(first_name_detector.pipe(docs))
# Displaying the documents one by one:
for doc in docs:
    skweak.utils.display_entities(piped_docs)

Alternatively, if you want to apply the detector one document at a time, you can do this:

# List of spacy doc objects; each doc object represents a sentence
docs = prepare_spacy_extensions(sents, labels, headers)
# Applying skweak on each iteration
for doc in docs:
    piped_doc = first_name_detector(doc)
    skweak.utils.display_entities(piped_doc)