explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

Unexpected behavior of nlp.has_pipe("sentencizer") #2693

Closed j-rausch closed 5 years ago

j-rausch commented 6 years ago

How to reproduce the behaviour

Basically following the tutorial in https://spacy.io/usage/linguistic-features#sbd-component with the addition of calling has_pipe() in a loop:

import spacy
from spacy.lang.en import English

nlp = English()  # just the language with no model
for i in range(2):
    if not nlp.has_pipe('sentencizer'):
        sbd = nlp.create_pipe('sentencizer')   # or: nlp.create_pipe('sbd')
        nlp.add_pipe(sbd)
    doc = nlp(u"This is a sentence. This is another sentence.")
    for sent in doc.sents:
        print(sent.text)

Leads to following error: ValueError: [E007] 'sbd' already exists in pipeline. Existing names: ['sbd']

I noticed that, while it is possible to create a pipeline with the "sentencizer" string, calling has_pipe("sentencizer") will not return True for the pipeline afterwards.

Replacing "sentencizer" with "sbd" in the code resolves the error.

Your Environment

python 3.6, spacy 2.0.12

gavrieltal commented 6 years ago

I was able to recreate the issue on my computer. It looks like pipe_create is the only pipe method for which the alias 'sentencizer' works, effectively just substituting the input 'sentencizer' for the input 'sbd' and expecting all future method calls to use 'sbd'. Until the issue is fixed, it probably makes the most sense just to stick with 'sbd' in your code.

It's feasible though non-trivial to make small code changes to most of the pipeline methods (pipe_names, has_pipe, replace_pipe, rename_pipe, remove_pipe, etc.) to accommodate arbitrary aliases for pipes.

To contributors more familiar, is having multiple possible names for each pipe necessary or worth the effort and loss of code elegance? Or should the code just accept 'sbd' and reject 'sentencizer'?

ines commented 6 years ago

Thanks for the detailed analysis – I'm pretty sure that's correct!

To contributors more familiar, is having multiple possible names for each pipe necessary or worth the effort and loss of code elegance? Or should the code just accept 'sbd' and reject 'sentencizer'?

To be honest, I think we should have just gone with sbd or at least standardised on one name. The original idea behind sentencizer was to make it more explicit and immediately obvious what the component does from looking at the pipe names (sbd is kinda cryptic if you don't know that it means "sentence boundary detection"). sbd seemed more convenient, though, so we added that as well... and now we're stuck with both 😝

We can still change this, possibly already for 2.1.0 (standardise on one name, output a warning if the other is used?). Also open to naming suggestions!

gavrieltal commented 6 years ago

how about sents? Otherwise, sentencizer seems fine. sbd seems cryptic to me.

I can make changes in the code/docs in favor of one name if you think it's a good idea.

gavrieltal commented 6 years ago

I can just follow through with sentencizer if that makes things simple going forward. I suppose I'll have to look for docs referring to sbd as well.

gavrieltal commented 5 years ago

bump

ines commented 5 years ago

Sorry, totally missed your comments! Thanks for offering to take care of this 👍

Let's go for sentencizer then – but we should probably only implement this on develop, since it's a breaking change. We don't need to worry much about existing models that include the sbd component, since users will have to retrain their models with v2.1.x anyways. However, spaCy should probably raise a more specific error on nlp.create_pipe('sbd') informing the user that this component name has been deprecated and to use 'sentencizer' instead.

gavrieltal commented 5 years ago

So there seem to be a lot of references to sbd internally (e.g. names of tests like in test_sbd.py) as well as references for the website (e.g. sbd-parser, sbd-manual, sbd-component, etc.) in addition to documentation. Would you want me to try and change both, or just the website, or stick to the source code? I'm happy to try, but I am new to spacy contribution.

ines commented 5 years ago

Would you want me to try and change both, or just the website, or stick to the source code?

I'd say sbd is totally fine internally, because it is the common abbreviation for "sentence boundary detection" (whereas "sentencizer" is really something we made up because it's a cool component name 😃).

So you'd only have to focus on changing string name of the built-in component (and references to it).

gavrieltal commented 5 years ago

I'm getting ready to submit my first PR, but even before my changes the currently existing tests seem to fail a lot. In particular, I'm getting "spacy/tests/lang/tr/test_lemmatization.py Killed" when I run py.test spacy. Is there something wrong with my developer environment?

ines commented 5 years ago

@gavrieltal First, are you on the develop branch? (The tests on there should be better and faster, and we want the changes on the develop branch anyways.)

And did you check that everything compiled correctly? Here's how I usually set up my development environment (macOS / Linux) and run tests:

python -m venv .env                  # create virtual environment
source .env/bin/activate             # activate virtual environment
pip install -r requirements.txt      # install requirements (including dev dependencies)
python setup.py build_ext --inplace  # compile spaCy
python -m pytest spacy               # run tests

Also, is there a traceback or do you see more info when the tests die?

gavrieltal commented 5 years ago

I switched to the develop branch (I was on master). I followed your instructions precisely, except the first line was python3 -m venv .env. (I ran pip install.. and pip3 install.., as well as python3 -m pytest spacy and python -m pytest spacy, and also I tried using python2's virtualenv just in case). I have the same problem. The compilation stage featured a number of warnings about deprecation and flags working for C/ObjC but not C++. Regarding errors, my test suite performance is the same as previously:


(.env) gavriel@stockholm:~/Desktop/python/spaCy$ python3 -m pytest spacy
================================================================== test session starts ==================================================================
platform linux -- Python 3.6.7, pytest-4.0.1, py-1.7.0, pluggy-0.8.0
rootdir: /home/gavriel/Desktop/python/spaCy, inifile:
plugins: timeout-1.3.3
collected 1633 items                                                                                                                                    

spacy/tests/test_align.py ............                                                                                                            [  0%]
spacy/tests/test_cli.py .                                                                                                                         [  0%]
spacy/tests/test_gold.py ......                                                                                                                   [  1%]
spacy/tests/test_misc.py ...........                                                                                                              [  1%]
spacy/tests/test_pickles.py ..                                                                                                                    [  1%]
spacy/tests/doc/test_add_entities.py ..                                                                                                           [  2%]
spacy/tests/doc/test_array.py .....                                                                                                               [  2%]
spacy/tests/doc/test_creation.py ...                                                                                                              [  2%]
spacy/tests/doc/test_doc_api.py ..........x.......                                                                                                [  3%]
spacy/tests/doc/test_pickle_doc.py .....                                                                                                          [  3%]
spacy/tests/doc/test_span.py ..............                                                                                                       [  4%]
spacy/tests/doc/test_span_merge.py ........                                                                                                       [  5%]
spacy/tests/doc/test_token_api.py ...........                                                                                                     [  6%]
spacy/tests/doc/test_underscore.py .....................                                                                                          [  7%]
spacy/tests/lang/test_attrs.py ...................................                                                                                [  9%]
spacy/tests/lang/ar/test_exceptions.py ......                                                                                                     [  9%]
spacy/tests/lang/ar/test_text.py .                                                                                                                [  9%]
spacy/tests/lang/bn/test_tokenizer.py ..........                                                                                                  [ 10%]
spacy/tests/lang/ca/test_exception.py ....                                                                                                        [ 10%]
spacy/tests/lang/ca/test_text.py .................                                                                                                [ 11%]
spacy/tests/lang/da/test_exceptions.py .................                                                                                          [ 12%]
spacy/tests/lang/da/test_lemma.py ....                                                                                                            [ 13%]
spacy/tests/lang/da/test_prefix_suffix_infix.py ................................                                                                  [ 15%]
spacy/tests/lang/da/test_text.py ..............                                                                                                   [ 15%]
spacy/tests/lang/de/test_exceptions.py .............                                                                                              [ 16%]
spacy/tests/lang/de/test_lemma.py ......                                                                                                          [ 17%]
spacy/tests/lang/de/test_parser.py ..                                                                                                             [ 17%]
spacy/tests/lang/de/test_prefix_suffix_infix.py ........................                                                                          [ 18%]
spacy/tests/lang/de/test_text.py .......                                                                                                          [ 19%]
spacy/tests/lang/el/test_exception.py .....                                                                                                       [ 19%]
spacy/tests/lang/el/test_text.py ......                                                                                                           [ 19%]
spacy/tests/lang/en/test_customized_tokenizer.py .                                                                                                [ 19%]
spacy/tests/lang/en/test_exceptions.py ....................................................                                                       [ 22%]
spacy/tests/lang/en/test_indices.py ..                                                                                                            [ 23%]
spacy/tests/lang/en/test_noun_chunks.py .                                                                                                         [ 23%]
spacy/tests/lang/en/test_parser.py .....                                                                                                          [ 23%]
spacy/tests/lang/en/test_prefix_suffix_infix.py .......................xx                                                                         [ 24%]
spacy/tests/lang/en/test_punct.py .......................................                                                                         [ 27%]
spacy/tests/lang/en/test_sbd.py ....x                                                                                                             [ 27%]
spacy/tests/lang/en/test_tagger.py .                                                                                                              [ 27%]
spacy/tests/lang/en/test_text.py .......x............                                                                                             [ 28%]
spacy/tests/lang/es/test_exception.py .....                                                                                                       [ 29%]
spacy/tests/lang/es/test_text.py ......                                                                                                           [ 29%]
spacy/tests/lang/fi/test_tokenizer.py ..                                                                                                          [ 29%]
spacy/tests/lang/fr/test_exceptions.py ............x.                                                                                             [ 30%]
spacy/tests/lang/fr/test_lemmatization.py ..x.                                                                                                    [ 30%]
spacy/tests/lang/fr/test_prefix_suffix_infix.py ..                                                                                                [ 30%]
spacy/tests/lang/fr/test_text.py ...                                                                                                              [ 31%]
spacy/tests/lang/ga/test_tokenizer.py ..                                                                                                          [ 31%]
spacy/tests/lang/he/test_tokenizer.py ......                                                                                                      [ 31%]
spacy/tests/lang/hu/test_tokenizer.py .x...x............x...x.................................................................................... [ 38%]
........................................................................................................................................          [ 46%] 
spacy/tests/lang/id/test_prefix_suffix_infix.py .........................                                                                         [ 48%]
spacy/tests/lang/id/test_text.py .                                                                                                                [ 48%]
spacy/tests/lang/ja/test_lemmatization.py sssss                                                                                                   [ 48%]
spacy/tests/lang/ja/test_tokenizer.py sssssssssssssss                                                                                             [ 49%]
spacy/tests/lang/nb/test_tokenizer.py ..                                                                                                          [ 49%]
spacy/tests/lang/nl/test_text.py ..                                                                                                               [ 49%]
spacy/tests/lang/pt/test_text.py ..                                                                                                               [ 49%]
spacy/tests/lang/ro/test_lemmatizer.py ....                                                                                                       [ 49%]
spacy/tests/lang/ro/test_tokenizer.py ......                                                                                                      [ 50%]
spacy/tests/lang/ru/test_exceptions.py sss                                                                                                        [ 50%]
spacy/tests/lang/ru/test_lemmatizer.py sssssssssssssssssssss                                                                                      [ 51%]
spacy/tests/lang/ru/test_text.py .                                                                                                                [ 51%]
spacy/tests/lang/ru/test_tokenizer.py sssssssssssssssssssssssssssssssssssssss                                                                     [ 54%]
spacy/tests/lang/sv/test_tokenizer.py .......                                                                                                     [ 54%]
spacy/tests/lang/th/test_tokenizer.py s                                                                                                           [ 54%]
spacy/tests/lang/tr/test_lemmatization.py Killed
(.env) gavriel@stockholm:~/Desktop/python/spaCy$```
ines commented 5 years ago

Perfect, the steps all sound correct and the output during compilation is normal 👍

How much memory does your machine have? Maybe you're running out of memory!

gavrieltal commented 5 years ago

wow, the one test is actually blowing my 3.5 GB of memory. I hit like 200% RAM usage in htop before the test was killed. My old-school hard-drive access light turns on. I'm not changing anything related to Turkish; maybe I'll just submit the request and depend on Travis.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.