Closed j-rausch closed 5 years ago
I was able to recreate the issue on my computer. It looks like pipe_create
is the only pipe method for which the alias 'sentencizer' works, effectively just substituting the input 'sentencizer' for the input 'sbd' and expecting all future method calls to use 'sbd'. Until the issue is fixed, it probably makes the most sense just to stick with 'sbd' in your code.
It's feasible though non-trivial to make small code changes to most of the pipeline methods (pipe_names
, has_pipe
, replace_pipe
, rename_pipe
, remove_pipe
, etc.) to accommodate arbitrary aliases for pipes.
To contributors more familiar, is having multiple possible names for each pipe necessary or worth the effort and loss of code elegance? Or should the code just accept 'sbd' and reject 'sentencizer'?
Thanks for the detailed analysis – I'm pretty sure that's correct!
To contributors more familiar, is having multiple possible names for each pipe necessary or worth the effort and loss of code elegance? Or should the code just accept 'sbd' and reject 'sentencizer'?
To be honest, I think we should have just gone with sbd
or at least standardised on one name. The original idea behind sentencizer
was to make it more explicit and immediately obvious what the component does from looking at the pipe names (sbd
is kinda cryptic if you don't know that it means "sentence boundary detection"). sbd
seemed more convenient, though, so we added that as well... and now we're stuck with both 😝
We can still change this, possibly already for 2.1.0 (standardise on one name, output a warning if the other is used?). Also open to naming suggestions!
how about sents
? Otherwise, sentencizer
seems fine. sbd
seems cryptic to me.
I can make changes in the code/docs in favor of one name if you think it's a good idea.
I can just follow through with sentencizer
if that makes things simple going forward. I suppose I'll have to look for docs referring to sbd
as well.
bump
Sorry, totally missed your comments! Thanks for offering to take care of this 👍
Let's go for sentencizer
then – but we should probably only implement this on develop
, since it's a breaking change. We don't need to worry much about existing models that include the sbd
component, since users will have to retrain their models with v2.1.x
anyways. However, spaCy should probably raise a more specific error on nlp.create_pipe('sbd')
informing the user that this component name has been deprecated and to use 'sentencizer'
instead.
So there seem to be a lot of references to sbd
internally (e.g. names of tests like in test_sbd.py) as well as references for the website (e.g. sbd-parser, sbd-manual, sbd-component, etc.) in addition to documentation. Would you want me to try and change both, or just the website, or stick to the source code? I'm happy to try, but I am new to spacy contribution.
Would you want me to try and change both, or just the website, or stick to the source code?
I'd say sbd
is totally fine internally, because it is the common abbreviation for "sentence boundary detection" (whereas "sentencizer" is really something we made up because it's a cool component name 😃).
So you'd only have to focus on changing string name of the built-in component (and references to it).
I'm getting ready to submit my first PR, but even before my changes the currently existing tests seem to fail a lot. In particular, I'm getting "spacy/tests/lang/tr/test_lemmatization.py Killed" when I run py.test spacy
. Is there something wrong with my developer environment?
@gavrieltal First, are you on the develop
branch? (The tests on there should be better and faster, and we want the changes on the develop
branch anyways.)
And did you check that everything compiled correctly? Here's how I usually set up my development environment (macOS / Linux) and run tests:
python -m venv .env # create virtual environment
source .env/bin/activate # activate virtual environment
pip install -r requirements.txt # install requirements (including dev dependencies)
python setup.py build_ext --inplace # compile spaCy
python -m pytest spacy # run tests
Also, is there a traceback or do you see more info when the tests die?
I switched to the develop
branch (I was on master
). I followed your instructions precisely, except the first line was python3 -m venv .env
. (I ran pip install..
and pip3 install..
, as well as python3 -m pytest spacy
and python -m pytest spacy
, and also I tried using python2's virtualenv
just in case). I have the same problem. The compilation stage featured a number of warnings about deprecation and flags working for C/ObjC but not C++. Regarding errors, my test suite performance is the same as previously:
(.env) gavriel@stockholm:~/Desktop/python/spaCy$ python3 -m pytest spacy
================================================================== test session starts ==================================================================
platform linux -- Python 3.6.7, pytest-4.0.1, py-1.7.0, pluggy-0.8.0
rootdir: /home/gavriel/Desktop/python/spaCy, inifile:
plugins: timeout-1.3.3
collected 1633 items
spacy/tests/test_align.py ............ [ 0%]
spacy/tests/test_cli.py . [ 0%]
spacy/tests/test_gold.py ...... [ 1%]
spacy/tests/test_misc.py ........... [ 1%]
spacy/tests/test_pickles.py .. [ 1%]
spacy/tests/doc/test_add_entities.py .. [ 2%]
spacy/tests/doc/test_array.py ..... [ 2%]
spacy/tests/doc/test_creation.py ... [ 2%]
spacy/tests/doc/test_doc_api.py ..........x....... [ 3%]
spacy/tests/doc/test_pickle_doc.py ..... [ 3%]
spacy/tests/doc/test_span.py .............. [ 4%]
spacy/tests/doc/test_span_merge.py ........ [ 5%]
spacy/tests/doc/test_token_api.py ........... [ 6%]
spacy/tests/doc/test_underscore.py ..................... [ 7%]
spacy/tests/lang/test_attrs.py ................................... [ 9%]
spacy/tests/lang/ar/test_exceptions.py ...... [ 9%]
spacy/tests/lang/ar/test_text.py . [ 9%]
spacy/tests/lang/bn/test_tokenizer.py .......... [ 10%]
spacy/tests/lang/ca/test_exception.py .... [ 10%]
spacy/tests/lang/ca/test_text.py ................. [ 11%]
spacy/tests/lang/da/test_exceptions.py ................. [ 12%]
spacy/tests/lang/da/test_lemma.py .... [ 13%]
spacy/tests/lang/da/test_prefix_suffix_infix.py ................................ [ 15%]
spacy/tests/lang/da/test_text.py .............. [ 15%]
spacy/tests/lang/de/test_exceptions.py ............. [ 16%]
spacy/tests/lang/de/test_lemma.py ...... [ 17%]
spacy/tests/lang/de/test_parser.py .. [ 17%]
spacy/tests/lang/de/test_prefix_suffix_infix.py ........................ [ 18%]
spacy/tests/lang/de/test_text.py ....... [ 19%]
spacy/tests/lang/el/test_exception.py ..... [ 19%]
spacy/tests/lang/el/test_text.py ...... [ 19%]
spacy/tests/lang/en/test_customized_tokenizer.py . [ 19%]
spacy/tests/lang/en/test_exceptions.py .................................................... [ 22%]
spacy/tests/lang/en/test_indices.py .. [ 23%]
spacy/tests/lang/en/test_noun_chunks.py . [ 23%]
spacy/tests/lang/en/test_parser.py ..... [ 23%]
spacy/tests/lang/en/test_prefix_suffix_infix.py .......................xx [ 24%]
spacy/tests/lang/en/test_punct.py ....................................... [ 27%]
spacy/tests/lang/en/test_sbd.py ....x [ 27%]
spacy/tests/lang/en/test_tagger.py . [ 27%]
spacy/tests/lang/en/test_text.py .......x............ [ 28%]
spacy/tests/lang/es/test_exception.py ..... [ 29%]
spacy/tests/lang/es/test_text.py ...... [ 29%]
spacy/tests/lang/fi/test_tokenizer.py .. [ 29%]
spacy/tests/lang/fr/test_exceptions.py ............x. [ 30%]
spacy/tests/lang/fr/test_lemmatization.py ..x. [ 30%]
spacy/tests/lang/fr/test_prefix_suffix_infix.py .. [ 30%]
spacy/tests/lang/fr/test_text.py ... [ 31%]
spacy/tests/lang/ga/test_tokenizer.py .. [ 31%]
spacy/tests/lang/he/test_tokenizer.py ...... [ 31%]
spacy/tests/lang/hu/test_tokenizer.py .x...x............x...x.................................................................................... [ 38%]
........................................................................................................................................ [ 46%]
spacy/tests/lang/id/test_prefix_suffix_infix.py ......................... [ 48%]
spacy/tests/lang/id/test_text.py . [ 48%]
spacy/tests/lang/ja/test_lemmatization.py sssss [ 48%]
spacy/tests/lang/ja/test_tokenizer.py sssssssssssssss [ 49%]
spacy/tests/lang/nb/test_tokenizer.py .. [ 49%]
spacy/tests/lang/nl/test_text.py .. [ 49%]
spacy/tests/lang/pt/test_text.py .. [ 49%]
spacy/tests/lang/ro/test_lemmatizer.py .... [ 49%]
spacy/tests/lang/ro/test_tokenizer.py ...... [ 50%]
spacy/tests/lang/ru/test_exceptions.py sss [ 50%]
spacy/tests/lang/ru/test_lemmatizer.py sssssssssssssssssssss [ 51%]
spacy/tests/lang/ru/test_text.py . [ 51%]
spacy/tests/lang/ru/test_tokenizer.py sssssssssssssssssssssssssssssssssssssss [ 54%]
spacy/tests/lang/sv/test_tokenizer.py ....... [ 54%]
spacy/tests/lang/th/test_tokenizer.py s [ 54%]
spacy/tests/lang/tr/test_lemmatization.py Killed
(.env) gavriel@stockholm:~/Desktop/python/spaCy$```
Perfect, the steps all sound correct and the output during compilation is normal 👍
How much memory does your machine have? Maybe you're running out of memory!
wow, the one test is actually blowing my 3.5 GB of memory. I hit like 200% RAM usage in htop
before the test was killed. My old-school hard-drive access light turns on. I'm not changing anything related to Turkish; maybe I'll just submit the request and depend on Travis.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
Basically following the tutorial in https://spacy.io/usage/linguistic-features#sbd-component with the addition of calling has_pipe() in a loop:
Leads to following error:
ValueError: [E007] 'sbd' already exists in pipeline. Existing names: ['sbd']
I noticed that, while it is possible to create a pipeline with the "sentencizer" string, calling has_pipe("sentencizer") will not return True for the pipeline afterwards.
Replacing "sentencizer" with "sbd" in the code resolves the error.
Your Environment
python 3.6, spacy 2.0.12