explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.94k stars 4.39k forks source link

entity id is not retained when using multiple processes #4849

Closed AlJohri closed 4 years ago

AlJohri commented 4 years ago

The ent_id and ent_id_ is not retained when using multiple processes during nlp.pipe. Presumably it is not getting serialized properly.

How to reproduce the behaviour

This example should be fully reproducible. The output should look like this:

-----------------------
USING 1 PROCESS
-----------------------
Joe Biden 61 70 PERSON 12900564667790333946 joe-biden
Bernie Sanders 500 514 PERSON 17935300311999639713 bernie-sanders
-----------------------
USING >1 PROCESS
-----------------------

Code:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load('en')

ruler = EntityRuler(
    nlp, patterns=[
        {"label": "PERSON", "pattern": 'joe biden', "id": 'joe-biden'},
        {"label": "PERSON", "pattern": 'bernie sanders', "id": 'bernie-sanders'},
    ],
    phrase_matcher_attr="LOWER"
)

nlp.add_pipe(ruler, before="ner")

text = """
The left is starting to take aim at Democratic front-runner Joe Biden.
At a conference this week, liberal activists repeatedly booed when told that Mr. Biden wanted to find "middle ground" on climate policy. When an audience member shouted "No middle ground!" Rep. Alexandria Ocasio-Cortez, D-N.Y., replied, "No middle ground is right!" and declared: "I will be damned if the same politicians who refused to act come back today and say we need a middle-of-the-road approach to save our lives." Sen. Bernie Sanders, I-Vt., joined in her criticism: "There is no 'middle ground' when it comes to climate policy."
The left's issues with the former vice president go far beyond his position on climate policy. To the neo-socialists now driving the debate in the Democratic primary campaign, Mr. Biden's entire approach to politics - reaching across the aisle and forging compromise built on consensus - is anathema.
Mr. Biden's supposed heresy is that he believes in working with Republicans. He says on the stump that Donald Trump is an "aberration" and predicts that if the president is defeated, Republicans will work toward bipartisan reform, which Mr. Biden insists is the only way to get anything worthwhile done. "This nation cannot function without generating consensus," he said in New Hampshire this week.
Well, generating consensus is not what the left wants. It is not simply opposed to Mr. Trump. Many liberals believe, as Ms. Ocasio-Cortez has put it, that "capitalism is irredeemable." So for many Democrats, the Obama-Biden approach to governing is now considered too moderate. On climate, they don't want the government to simply invest in green energy, like President Barack Obama did. They want to spend tens of trillions of dollars to replace every vehicle that uses a combustion engine, bring high-speed rail to every corner of the country, upgrade or replace every building in the United States and eliminate all fossil-fuel energy.
On health care, they no longer make a pretense of promising voters they can keep their health plans, like Mr. Obama did. They openly advocate abolishing private insurance altogether. Mr. Biden's support for a "public option" that would give Americans a choice of buying into a Medicare-like health plan is seen on the left as capitulation. There will be no choices in the brave new world of democratic socialism. We will have government-run health care for all, whether we want it or not.
Of course, Mr. Biden is no moderate. He is an old-fashioned, liberal Democrat. But to the Sanders and Ocasio-Cortez wing of the party, that makes him too far to the right - and too willing to compromise with the far-right. I saw Mr. Biden's willingness to do so up close when I worked on the staff of the Senate Foreign Relations Committee during the 1990s. As the ranking Democrat, Mr. Biden prided himself on his ability to compromise with committee chairman Jesse Helms, R-N.C., arguably the most uncompromising conservative in the Senate. Together, they passed legislation - the so-called Helms-Biden Act - to reform the United Nations and cut deals to restructure the State Department.
"As chairman and ranking member, we passed some of the most significant legislation passed in the last 40 years," Mr. Biden explained during a 2015 speech. He continues to tout his relationship with Helms (who died more than a decade ago) on the stump as an example of how he can work with die-hard conservatives to get things done.
Is this what Democratic primary voters want? Mr. Biden's lead in the national polls suggests it may be. But it is early. After all, at this time in 2015, Scott Walker, the Wisconsin governor at the time, appeared to be the front-runner for the Republican nomination and no one was taking Mr. Trump seriously. Mr. Biden may be ahead for now, but all the energy inside the Democratic Party seems to be with the uncompromising left. It sees Mr. Biden standing in the way of its takeover of the Democratic Party. So as his lead in the polls expands, their efforts to stop him - and his heretical calls for compromise - will escalate.
"We have to unify this country," Mr. Biden said at a speech in Iowa earlier this month. "The other side is not my enemy, it's my opposition." How sad that has become a controversial statement.
"""

print('-----------------------')
print('USING 1 PROCESS')
print('-----------------------')

for doc in nlp.pipe([text], n_process=1):
    for x in doc.ents:
        if x.ent_id > 0:
            print(x, x.start_char, x.end_char, x.label_, x.ent_id, x.ent_id_)

print('-----------------------')
print('USING >1 PROCESS')
print('-----------------------')

for doc in nlp.pipe([text], n_process=2):
    for x in doc.ents:
        if x.ent_id > 0:
            print(x, x.start_char, x.end_char, x.label_, x.ent_id, x.ent_id_)

Info about spaCy

I'm using master branch as of commit 3431ac42de470a4bb73f1c6852a5ccffc07da7b1.

AlJohri commented 4 years ago

I just want to add that custom attributes are also not retained when using multiple processes. I tried running:

Span.set_extension('entity_id', default=None)

# include this at the end of my custom component
for ent in doc.ents:
    ent._.entity_id = ent.ent_id_

The custom attribute entity_id is also not retained when n_process=2.

AlJohri commented 4 years ago

Found a temporary work around:

# include this at the end of my custom pipeline component
doc.user_data['labels'] = [(x.start_char, x.end_char, x.label, x.ent_id_) for x in doc.ents]

doc.user_data gets retained when using multiple processes

svlandeg commented 4 years ago

Hi @AlJohri , thanks for the detailed analyses and report !

The token.ent_id attribute was indeed not being serialized. PR #4852 should fix that.

With respect to the custom attributes I'm a little puzzled though, because we have a serialization test that checks just that, and I even expanded it to test the token level in the same PR, and got no errors. If the PR gets merged and problems persist afterwards, I'd suggest opening a new issue to address that specific problem.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.