explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.63k stars 4.36k forks source link

PhraseMatcher Silently Fails After Pickling #4333

Closed aaronkub closed 4 years ago

aaronkub commented 4 years ago

How to reproduce the behaviour

Hello and thank you very much for this beautiful library. :)

I am using spaCy in a PySpark application and am finding that the PhraseMatcher behaves differently than when I test locally. I believe the root cause is that I'm pickling the Language (nlp) object and deserializing on the workers.

From what I can tell this is the same issue raised in #1939 and #3248.

The below code fails with python 3.6.9 and spaCy 2.1.8

import pickle
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')

matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
matcher.add("test_words", None, *[nlp.make_doc("random")])

test_doc = nlp("This is a Random phrase.")

print("Original Matcher Found {} matches."
    .format(len(matcher(test_doc)))) # 1 match

with open("matcher.pickle", 'wb') as tmp:
    pickle.dump(matcher, tmp)

with open("matcher.pickle", 'rb') as tmp:
    unpickled_matcher = pickle.load(tmp)

print("Unpickled Matcher Found {} matches."
    .format(len(unpickled_matcher(test_doc)))) # 0 matches

Your Environment

Info about spaCy

ines commented 4 years ago

Thanks for the report! We actually just merged #4309, which includes a complete rewrite of the PhraseMatcher algorithm and also reimplements pickling. So ideally, it should all work as expected again.

If you want to try it out in your codebase, you can already install the upcoming version via pip install spacy==2.2.0.dev11. (It's not a stable release yet, so you probably want to use a separate virtual environment. You'll also need to download new models for compatibility.) If you end up trying it, let us know how you go! The PhraseMatcher should now also be even faster, around 10× faster than before.

adrianeboyd commented 4 years ago

Do you have a really large number of phrases to match? It might be related to #4308, and when you unpickle it loads the phrases in a different order so it ends up missing a different subset of the matches than before. (I admit that I don't understand exactly what's going wrong with the docs/vocab when it starts missing matches, though.)

I tested the new PhraseMatcher with pickling and unpickling up to ~450,000 phrases and didn't run into any errors in very straightforward tests, but it's always good to see it tested in more realistic scenarios, too.

aaronkub commented 4 years ago

Thanks to both of you for the quick reply! My list of phrases is not huge (~2000) but it does sound similar to #4308 because only certain phrases are missed after unpickling.

As far as testing the new PhraseMatcher, I'm having an issue loading the model in the latest development release.

python3.6 -m venv env
source env/bin/activate
python -m pip install spacy==2.2.0.dev11
python -m spacy download en_core_web_sm --no-deps  # (without this flag, getting "ERROR: No matching distribution found for spacy>=2.2.0")
import spacy
nlp = spacy.load("en_core_web_sm")

Outputs: "ValueError: [E160] Can't find language data file: /path/to/env/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/lemma_lookup.json.gz

I'm assuming this is some kind of user error on my part and also off topic from the reported bug, but if anyone knows how to fix this error that would be great.

ines commented 4 years ago

Ah, sorry, the development sdist seems to be a bit broken 😞 If it's not too much of a hassle, try building from source?

git clone https://github.com/explosion/spaCy
cd spaCy
export PYTHONPATH=`pwd`
pip install -r requirements.txt
python setup.py build_ext --inplace
adrianeboyd commented 4 years ago

It just occurred to me that the PhraseMatcher forgets the attr on pickling. This could also be the problem?

aaronkub commented 4 years ago

Okay I think at least part of my problem is that the PhraseMatcher is forgetting the attr=LOWER after unpickling. It looks like this is still a problem in the latest build..

Using this script:

import pickle
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')

matcher_no_attr = PhraseMatcher(nlp.vocab)
matcher_no_attr.add("test_words", None, *[nlp("word")])

matcher_lower = PhraseMatcher(nlp.vocab, attr="LOWER")
matcher_lower.add("test_words", None, *[nlp("word")])

test_doc = nlp("Word, WORD, word.")

with open("matcher_no_attr.pickle", 'wb') as tmp:
    pickle.dump(matcher_no_attr, tmp)
with open("matcher_lower.pickle", 'wb') as tmp:
    pickle.dump(matcher_lower, tmp)
with open("matcher_no_attr.pickle", 'rb') as tmp:
    matcher_no_attr_unpickled = pickle.load(tmp)
with open("matcher_lower.pickle", 'rb') as tmp:
    matcher_lower_unpickled = pickle.load(tmp)

print("No attr matcher found {} matches."
    .format(len(matcher_no_attr(test_doc))))
print("(Unpickled) No attr matcher found {} matches."
    .format(len(matcher_no_attr_unpickled(test_doc))))
print("LOWER attr matcher found {} matches."
    .format(len(matcher_lower(test_doc))))
print("(Unpickled) LOWER attr matcher found {} matches."
    .format(len(matcher_lower_unpickled(test_doc))))

The output is:

No attr matcher found 1 matches.
(Unpickled) No attr matcher found 1 matches.
LOWER attr matcher found 3 matches.
(Unpickled) LOWER attr matcher found 1 matches.

Info about spaCy

ines commented 4 years ago

Okay, I just merged @adrianeboyd's PR in #4336, which should address the attr pickling! Could you try again using the version on master?

aaronkub commented 4 years ago

Looks like that resolved it! It's working for me with the master build. Thanks for your quick responses, especially on Sunday. 😁

adrianeboyd commented 4 years ago

It was a little unpredictable, but I didn't usually have the vocab problems with <10000 phrases, so I bet the attr was the main issue. I'm glad to hear it's working!

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.