explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.34k stars 4.33k forks source link

💫 spaCy v2.0.0 alpha – details, feedback & questions (plus stickers!) #1105

Closed ines closed 6 years ago

ines commented 7 years ago

We're very excited to finally publish the first alpha pre-release of spaCy v2.0. It's still an early release and (obviously) not intended for production use. You might come across a NotImplementedError – see the release notes for the implementation details that are still missing.

This thread is intended for general discussion, feedback and all questions related to v2.0. If you come across more complex bugs, feel free to open a separate issue.

Quickstart & overview

The most important new features

Installation

spaCy v2.0.0-alpha is available on pip as spacy-nightly. If you want to test the new version, we recommend setting up a clean environment first. To install the new model, you'll have to download it with its full name, using the --direct flag.

pip install spacy-nightly
python -m spacy download en_core_web_sm-2.0.0-alpha --direct   # English
python -m spacy download xx_ent_wiki_sm-2.0.0-alpha --direct   # Multi-language NER
import spacy
nlp = spacy.load('en_core_web_sm')
import en_core_web_sm
nlp = en_core_web_sm.load()

Alpha models for German, French and Spanish are coming soon!

Now on to the fun part – stickers!

stickers

We just got our first delivery of spaCy stickers and want to to share them with you! There's only one small favour we'd like to ask. The part we're currently behind on are the tests – this includes our test suite as well as in-depth testing of the new features and usage examples. So here's the idea:

Submit a PR with your test to the develop branch – if the test covers a bug and currently fails, mark it with @pytest.mark.xfail. For more info, see the test suite docs. Once your pull request is accepted, send us your address via email or private message on Gitter and we'll mail you stickers.

If you can't find anything, don't have time or can't be bothered, that's fine too. Posting your feedback on spaCy v2.0 here counts as well. To be honest, we really just want to mail out stickers 😉

honnibal commented 7 years ago

How's the speed relative to CPU for you?

vishnune-kore commented 7 years ago

@honnibal entities: 5 train data size : 30 sentences processor : Intel® Core™ i7-6500U CPU @ 2.50GHz × 4 time : ~30 sec

processor : Intel® Core™ i7-5500U CPU
time : ~43 sec

For Intel® Core™ i7-5500U CPU + Geforce 840M time ~19 sec GPU utilization : avg 70% Geforce has compute capability 5.0

tindzk commented 6 years ago

Thanks for the new release! The API changes are much appreciated.

Would it be possible to provide a pre-trained German NER model for v2.0.0-alpha? I tried xx_ent_wiki_sm, but it yielded too many false positives in my tests. I wrote up a small script to train my own model on the GermEval dataset (using German.begin_training), but found that despite GPU, training took a considerable time. It doesn't seem to perform any batching, so I had to combine several sentences into a long text. Unfortunately, this may have created a bias towards longer inputs. When trying to classify single sentences, my custom model never recognised any entities.

Is there any code from previous versions that I could adapt?

AthiraMstack commented 6 years ago

I created a NER model using spacy version 2.0 by adding some new entity types and saved the model using the to_disk function as per the documentation given in https://alpha.spacy.io/docs/usage/training-ner . When I test the model, it is identifying only the new entities. All the other entities identified by spacy original model are tagged incorrectly or not tagged at all. What may be the reason ? How can I add the new entities to the original spacy model so that it will identify all the tags plus my new NER tags ?

Can anyone please help.. Its really urgent.. ! @honnibal

JoshMcclung commented 6 years ago

@rajeee any luck with this? I'm having the same problem.

Thanks!

asyd commented 6 years ago

@AthiraMstack I agree, I think more effort should be done on this part, I have same exactly the same issue. See #1182

oroszgy commented 6 years ago

@honnibal @ines I am trying to port my Hungarian model to v2, but it looks like that the python -m spacy model no longer works on the dev branch. Also the documentation does not say anything assembling the raw files into a model. Could you please help me where should I start?

honnibal commented 6 years ago

@oroszgy Try

import spacy
nlp = spacy.blank('hu')
nlp.to_disk('/path/to/export')

This should give you the model skeleton. I guess we should have another command like this on the command line?

honnibal commented 6 years ago

Btw you should now be able to just do spacy from the command line, instead of python -m spacy.

I had been worried about adding scripts, because I hate it when I install stuff into the system and I run stuff that's not in my virtualenv. We found a nice solution to this though: the script just runs python -m spacy $@ --- so you always know that when you run spacy, you're getting the one that matches your Python import.

oroszgy commented 6 years ago

@honnibal It looks like I've managed to port the model cli from the 1.x branch. Do you want me to do a PR?

viksit commented 6 years ago

@honnibal is token.prob not yet implemented in the default model?

a = nlp(u"this is a sentence")
print("d: ", a[0].dep_)
print("p: ", a[0].prob)

d:  nsubj
p:  0.0
niklas88 commented 6 years ago

I'm having trouble getting the pickle support to work. In our system we pickle query objects for a QA system so as not have to rerun all queries when retraining. These include spans of tokens. I switched to spaCy recently and currently it works great except for this caching mechanism being broken.

So now I tried with spaCy v2.0 alpha and am getting the following error while unpickling:

  File "/app/query_translator/learner.py", line 128, in get_cached_evaluated_queries
    queries = pickle.load(open(cached_filename, 'rb'))
  File "spacy/tokens/token.pyx", line 27, in spacy.tokens.token.Token.__cinit__     (spacy/tokens/token.cpp:3912)
TypeError: __cinit__() takes exactly 3 positional arguments (0 given)

Edit I had a small bug where I was storing lists of tokens instead of spans, fixing this gives me basically the same error but on the span:

  File "spacy/tokens/span.pyx", line 24, in spacy.tokens.span.Span.__cinit__ (spacy/tokens/span.cpp:3375)
TypeError: __cinit__() takes at least 3 positional arguments (0 given
Nicholas-Autio-Mitchell commented 6 years ago

Can anyone explain exactly how the models are so much smaller in memory now? Are there any pros/cons other than greater available memory/slower to train?

I posted it here on StackOverflow as well.

andychisholm commented 6 years ago

I'm seeing a slow down on the order of ~10x in single-threaded performance between 1.9.0 and spacy-nightly.

Setup looks like this:

nlp = spacy.load('en_core_web_sm', parser=False)

And we're measuring the time of:

for d in docs:
    sd = nlp(d['text'])
    d['mentions'] = [{
        'start': ent.start_char,
        'stop': ent.end_char,
        'label': ent.label_
    } for ent in sd.ents]

Given a sample of 1000 docs: 1.9.0 w/ en_core_web_sm ~= 103 doc/s 2.0.0a11 w/ en_core_web_sm ~= 11 doc/s

Running on Ubuntu 14.04 Box has 32 cores but I've set OPENBLAS_NUM_THREADS=1 (both tests only appear to only utilise a single core in htop)

honnibal commented 6 years ago

Use nlp.pipe() -- spaCy 2 really needs to minibatch to get performance atm. Even with only 1 thread the performance is much better with the minibatching, because there are fewer total function calls, and there's per-vocabulary-item caching for the word vector calculations.

If you use .pipe(), the single-thread performance should be within 2-3x of the 1.9 model. The multi-thread performance is I think 3-4x behind. Note that you want to be setting OPENBLAS_NUM_THREADS to something like 6. Currently if I don't constrain the threads, the stupid thing runs like 20 threads and they all contend uselessly :(. I haven't figured out a run-time fix for this yet (help wanted).

If you use a GPU you should be able to get performance roughly on par with the 1.9 model.

In future I'll write more of the network in Cython, to cut some of the Python function calls. This will improve performance on low batch sizes. For now it's best to just use nlp.pipe() to process the text in minibatches.

@Nicholas-Mitchell It's a long story, but basically "Because neural networks".

andychisholm commented 6 years ago

@honnibal that makes a lot of sense. Unfortunately I haven't been able to get much of an improvement going through .pipe. A little random walking around (batch_size, n_threads, OPENBLAS_NUM_THREADS) doesn't seem to help things much either.

Interestingly, the machine is configured with a GPU (GTX980) which does seem to get used (a process shows up in nvidia-smi when running the test). However, the performance is the same as an identically configured box without the GPU - which all seems a little suspicious to me, I wonder if something else is nuking performance?

RenatoOAAguiar commented 6 years ago

Hi everyone. I was trying to training a new model, in spaCy v2, using the same aproach i have used for spaCy 1.8. This aproach is located on spacy-dev-resource, i was trying to gererate a new vocab using wikipedia dumpy file. In the beginning all was looking just fine, but when the execution calls the line 65, of wiki2text.py, an error ocours, is that line: nlp = spacy.load(lang, parser=None, tagger=None)

He can not find the model, in this case my lang is pt, sure the model realy dont exists, because i'm creating one now, but for my experience on spaCy 1.8, is this part he will load the tokenizer, even the model not exists.

pt language exists in spaCy folders, like anothers others languages, i'm wrong or he should load the tokenizer?

If the tokenizer is not loaded, i'll not be able to execute this sequence of code, of wiki2text.py:

def main(dump_path, out_dir, lang, cleaned=True):
    reader = WikiReader(dump_path)
    nlp = spacy.load(lang, parser=None, tagger=None)
    for id, title, content in tqdm(reader):
        text_content = extract_text(content, nlp, cleaned)
        if text_content:
            write_file(id, out_dir, text_content, title)

Thanks for attention and for help.

honnibal commented 6 years ago

There's a new function spacy.blank() that does what you want -- it was always confusing that spacy.load() wouldn't fail if it couldn't load the model.

RenatoOAAguiar commented 6 years ago

Thanks @honnibal I'll try that!

christian-storm commented 6 years ago

I've been trying to debug spacy to see why spacy sets more sent_start to True even though sentence segmentation was previously done, e.g., step 0 of the pipeline as illustrated in example.

I was hoping to get a pointer for how you/anyone debugs python/cython code, e.g., within pycharm on macos. I've combed the web and tried countless things and so far am coming up short.

honnibal commented 6 years ago

Mostly I just add print statements and write tests. I've only used an actual debugger once or twice.

What's the problem, specifically? Whatever you're doing for Python should normally work. If you're in a nogil block you can use with gil: to reacquire the GIL so you can print etc.

christian-storm commented 6 years ago

Despite my efforts I was unable to get a working combo of python/cython/gdb for debugging. Uggh, what a time suck. I was hoping to do so before responding. Alas I'm stuck with adding print statements (thanks for with gil: trick!) to try and figure out what is going on. The drag with this method is that you have to recompile each time. Really slows debugging down. I digress...

I'm using 2.0.0a14 and had thought that once sentences are segmented the sent_starts would be set in stone. However, it appears that they aren't. I wrote a script to show what I mean. I add print statements everywhere sent_start is changed and found the culprit. Is this the intended behavior?

import spacy

def main():
    text = "Of course, I'm older but I would treat the occasion with the same passion! I really would."

    nlp = spacy.load('en_core_web_sm')
    nlp.pipeline.insert(0, endpunct_segmentation)
    doc = nlp(text)
    for tok in doc:
        print("{}) {} sent_start={}".format(tok.i, tok.text, tok.sent_start))
    print([sent.text_with_ws for sent in doc.sents])

def endpunct_segmentation(doc):
    for token in doc[:-1]:
        if token.text in ['.', '?', '!']:
            print("Set {} or token[{}].sent_start = True".format(doc[token.i +1].text, token.i + 1))
            doc[token.i + 1].sent_start = True
    return doc

if __name__ == '__main__':
    main()

The output is

> Set I or token[17].sent_start = True > _state.set_break set token[6].sent_start to True > _state.set_break set token[17].sent_start to True > 0) Of sent_start=False > 1) course sent_start=False > 2) , sent_start=False > 3) I sent_start=False > 4) 'm sent_start=False > 5) older sent_start=False > 6) but sent_start=True *** Set by [arc_eager.transition](https://github.com/explosion/spaCy/blob/develop/spacy/syntax/arc_eager.pyx#L256) ->[syntax/_state.pxd](https://github.com/explosion/spaCy/blob/develop/spacy/syntax/_state.pxd#L357) > 7) I sent_start=False > 8) would sent_start=False > 9) treat sent_start=False > 10) the sent_start=False > 11) occasion sent_start=False > 12) with sent_start=False > 13) the sent_start=False > 14) same sent_start=False > 15) passion sent_start=False > 16) ! sent_start=False > 17) I sent_start=True > 18) really sent_start=False > 19) would sent_start=False > 20) . sent_start=False > ["Of course, I'm older ", 'but I would treat the occasion with the same passion! ', 'I really would.'] >
samrensenhouse commented 6 years ago

I swapped to spacy2 to see if I could get multiprocessing to work but I still get the same error: TypeError: can't pickle spacy.tokens.doc.Doc objects

Is this not supported now?

honnibal commented 6 years ago

@christian-storm How long is compiling taking for you? It should only compile the module affected.

I'm not sure this is the information you're looking for, but:

spaCy predicts the sentence boundaries jointly with the dependency parser. There should be a mode to pre-set the sentence boundaries and disable the Break transition, but currently there's not. The following should prevent the parser from ever inserting sentence boundaries:


cdef bint never_valid(const Statec* st, attr_t label) nogil:
    return 0

cdef class ArcEager:
    def disable_break(self):
        for i in range(self.n_moves):
            if self.c[i].move == BREAK:
                self.c[i].is_valid = never_valid

Does that help?

christian-storm commented 6 years ago

I just tested it and pip install -e . takes 2+ minutes from a fresh checkout, 2 secs for a rerun without any changes, and 8 secs for a one line change to doc.pyx. So, to your point, not that bad. I was more referring to having to cycle through the process of adding print statements -> recompiling -> rerun when one wants to know what a certain variable is set to. Also, for those new to the code base it really helps to be able to step through to see what the code is doing. If/when I get an env working for debugging cython I'll post how I did it.

Regarding the sentence segmentation...thanks for the patch! I'm probably being thick but where should disable_break be called? I was hoping to make it an argument since I'm allowing for one to choose tokenizers and sentence segmenters in my code....spacy's as the defaults of course! :) The use case being tests that rely on certain tokenizers (e.g., punkt) and sentence segmentations (e.g., newline segmented) to reproduce published results.

FYI In reading the docs "Let's say you want to implement custom logic to improve spaCy's sentence boundary detection." it first led me to believe the external segmentation would replace spacy's sbd. Now I can also see how this sentence might be interpreted as an external sbd would augment/improve spacy's sbd, i.e., provide an external cost signal for spacy to use when determining sbd and possibly help guide its parse. For instance, in my example it would've signaled to the parser that splitting on the lowercase but might not be a great idea. Currently, all it is doing is adding boundaries to the existing set of boundaries found by spacy. Semantics...

Along the same lines what is SentenceSegmenter used for? I don't find any references in the code.

honnibal commented 6 years ago

Regarding the sentence segmentation...thanks for the patch! I'm probably being thick but where should disable_break be called? I was hoping to make it an argument since I'm allowing for one to choose tokenizers and sentence segmenters in my code....

That's a difficult question, really! I've been struggling with this sort of thing a bit. I'm trying to avoid adding too many arbitrary methods to the Language object, because it shouldn't have to know too much about how the pipeline works.

Alternate solutions feel pretty complicated though. What would you like the calling code to look like?

FYI In reading the docs

Apologies here...I think this is a case where the docs describe the intended, but not current behaviour :(

christian-storm commented 6 years ago

Apologies for the delay in responding. I've spent the last couple days pouring through the code to arrive at a more informed answer to your question: "What would you like the calling code to look like?"

IMHO Sentence boundary detection (SBD) suffers from having been conflated with parsing. They are by design inseparable. I think SBD should allowed to be its own pipeline component alongside tokenization. SBD keeps cropping up (#235, #1032, #453, etc.) with a wide range of valid use cases. As further fuel for the fire, from my own tests comparing the segmentations of spacy, corenlp, segtok, pragmatic segmenter, and punkt on wikipedia dumps, spacy disagrees the most with the others. That being said co-training for SBD/dependencies is really neat and useful in many use cases, e.g., speech data.

As you well know sentence boundaries are immutable because if they're changed they'd leave the parse in an inconsistent state. A number of quick and dirty solutions have been offered but they all fall short for those of us who want the dependency parse to encapsulate spacy's notion of a sentence.

My naive question is why not have two separate parsers: One trained with predicting BREAK (backward compatibility) and another "normal one" without? Out of curiosity, I wonder if you've tested whether the parsing accuracy is higher without the BREAK transition?

So to answer your question one could add a sbd component that sets the sent_starts before parsing and set doc.is_sbd = True when finished. Parser selection would then be based on that.

honnibal commented 6 years ago

A few points quickly on this -- apologies for not engaging deeply yet. Edit: Okay this turned out to be long, but it's still quite stream-of-conscious.

  1. There was a bug in the Break oracle that was fixed pretty recently. So if you tested before 2.0.0a4 the results may be different.

  2. Wikipedia's not a great test, because it's pretty reliably edited. That's not so true of a lot of web text.

  3. I'd like to lean towards a design where you load up a pipeline, and that pipeline is fully configured with few switches or toggles. To get different behaviours, you load a different pipeline.

  4. We can make the sentence boundaries mutable after parsing. We just need some logic to cut the tree.

A couple of specific replies:

My naive question is why not have two separate parsers: One trained with predicting BREAK (backward compatibility) and another "normal one" without?

Sure --- and actually we'd like to have these trained on different text types. This has always been the plan, but we found we really wanted to do annotations to train some of these models, so we decided to get Prodigy finished first.

What I don't think we want is to have two parses shipped as part of a single pipeline, and then decide between them at runtime based on the document state.

If you do really want this switching strategy, I think a pretty good way to implement it would be to write a component that wrapped N parsers, and delegated to one of them based on whatever logic. The switcher component would be added to the pipeline.

Out of curiosity, I wonder if you've tested whether the parsing accuracy is higher without the BREAK transition?

My tests were quite some time ago, so -- not recently. Here's the general finding, which I suspect still holds true: the most important thing is that the pipeline is trained end-to-end, so that the unsegmented text is seen during training. If you train with gold-standard sentence boundaries, segmentation errors will really throw the parser off. Even if your segmenter is 99.5% accurate on a per-token basis, the errors will be frequent enough to move parse accuracy by quite a lot. The transition-based parser is quite sensitive to this. If it ends up in an unexpected state, it can make poor choices and start diverging from the correct analysis.

We don't necessarily need to use the Break transition to train the parser on text with incorrect segmentation. Instead, we could figure out how the oracles should look for the other actions, if the segment boundaries are incorrect. I thought about this problem a bit when I was doing joint speech parsing and segmentation. I keep forgetting that my work on this with Mark Johnson was never published[1] --- I've attached the paper, although it's not that helpful on this.

It will take a bit of thinking to get the oracle for the incorrectly segmented text correct. But once we have this we can train parsers that condition on pre-processed text, which should be helpful.

There are some other advantages to getting this right going forward. Not all treebanks make the source text available, making the joint training strategy hard. Parsing is now both slower and less relatively important than it was, making it important to support good sentence boundary detection.

[1] A bit of backstory for why this draft was never published...When we finished this work the next logical conference was EMNLP 2014, which was being held in Qatar. I didn't speak up before the decision was made, so I can only complain so much --- but I decided not to participate in the event. The paper was therefore submitted to the new journal TACL. The reviews were good, but the editor said TACL wanted to only publish quite noteable work. By the time we got the reviews back, I'd left my post-doc, had released spaCy, and had no time to revise the paper. I also predicted interest would be low, as the model would look out-dated given the lack of deep learning.

segment_tacl_submission.pdf

benhachey commented 6 years ago

Hi guys - Very excited about spaCy v2. Can you share a rough prediction for a stable release?

vishnune-kore commented 6 years ago

@ines , @honnibal I don't see any support for xx NER model after a9, Does this mean its not going to be discontinued?

christian-storm commented 6 years ago

There was a bug in the Break oracle that was fixed pretty recently. So if you tested before 2.0.0a4 the results may be different.

Fair enough, I gave it another whirl and report the more current results.

Wikipedia's not a great test, because it's pretty reliably edited. That's not so true of a lot of web text. I agree and disagree. Wikipedia is a great test for reasonably edited but difficult to segment text because, as Wikipedia itself puts it:

...sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address – not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang. Wikipedia is rife with citation marks[4], (parentheticals), unbalanced quotes, quotes within quotes, etc. A lot of these situations are codified in pragmatic segmenter's golden rules test set. These tend to be the case were the various segmenters disagree.

By "edited" I'm not sure if you are referring to text that is grammatical, is comprised of complete sentences (aka not lists, twittereese, table of contents, etc.), properly sentence cased, and/or accurately punctuated. I'm sure spacy would be superior in certain circumstances and inferior in others. The point I was trying to make is that for a given corpus and task there will likely be one sbd algorithm that is the Right Tool For The Job. Sometimes even the notion of a "sentence" isn't well defined- She turned to him and said, "This is great (thinking to herself just the opposite)." and then walked away. or The results were state of the art. [1] As depicted in ... Should a segmenter break up all spans the have a subject and a predicate or honor quotes and allow for sentences within sentences? Is [1] part of the first sentence or a sentence onto itself? The notion of what constitutes sentence is very task dependent.

I'd like to lean towards a design where you load up a pipeline, and that pipeline is fully configured with few switches or toggles. To get different behaviors, you load a different pipeline.

Hah! I was going to mention this last time but didn't want to muddy the waters any further. I hear where you are coming from...it is clean that way. As I've been building these I've had the need to pass parameters though. For example, pass a param that toggles whether a token is split when sbd finds a boundary mid-token, e.g., "... from reference.[1] The..." Spacy treats reference.[1] as one token but sbd says the boundary is after the period. I suppose one can create two wrappers with one passing T and the other F. However, what happens when you have a number of parameters...things get ugly quick. The other solution that isn't allowed right now is to run the sbd before tokenization which would be nice so that the doc doesn't have to be retokenized when a token needs splitting (I wish there was a doc.split to complement the doc.merge). That being said, I think you might say I have to tweak the tokenizer so that the citation is cleaved off as its own token and that would be a fair argument. Tokenization and sbd are very intertwined...for instance spacy will break up "...test.Please" but not "...test.please".

We can make the sentence boundaries mutable after parsing. We just need some logic to cut the tree. Cool! There would need to be merge logic as well, no?

we'd like to have these trained on different text types. This has always been the plan, but we found we really wanted to do annotations to train some of these models, so we decided to get Prodigy finished first.

Makes total sense!

What I don't think we want is to have two parses shipped as part of a single pipeline, and then decide between them at runtime based on the document state.

If you do really want this switching strategy, I think a pretty good way to implement it would be to write a component that wrapped N parsers, and delegated to one of them based on whatever logic. The switcher component would be added to the pipeline.

I hadn't thought of that approach and agree that is the better way to go. Without the ability to pass params for switching I guess one would have to rely storing these variables in user_data?

It will take a bit of thinking to get the oracle for the incorrectly segmented text correct. But once we have this we can train parsers that condition on pre-processed text, which should be helpful.

This would be very powerful if I'm understanding it correctly.

Thanks for passing along your article. I need to spend a little time with it to really grok it.

honnibal commented 6 years ago

@benhachey You can see the current TODOs here: https://github.com/explosion/spaCy/projects . The "Stable" board is the easiest to look at, because it has the per-class items.

The most difficult ticket is the Matcher operators one. We might drop that ticket from the release, because the same bug exists in 1.x, and I don't think it changes the API.

The sentence boundary stuff being discussed here is another tricky ticket.

@vishnunekkanti It's not being discontinued. We just didn't have enough automation around the model training, and the nightlies were often breaking model compatibility, so we fell behind on training all the models.

@christian-storm

The point I was trying to make is that for a given corpus and task there will likely be one sbd algorithm that is the Right Tool For The Job

Fair enough. You don't need to resell me on the need for the SBD component :). I'm on board with this.

For example, pass a param that toggles whether a token is split when sbd finds a boundary mid-token, e.g., "... from reference.[1] The..." Spacy treats reference.[1] as one token but sbd says the boundary is after the period. I suppose one can create two wrappers with one passing T and the other F. However, what happens when you have a number of parameters...things get ugly quick.

Why pass these on a per-document basis, though? If the parameters are per component there's no problem: you assemble a pipeline, save it to disk fully configured, and wrap it as a pip package. If you need to have code to assemble the pieces, you can put the logic in the package's load() function, along with any necessary parameters. Then you can do nlp = my_pipeline.load().

If we need to pass a lot of per-document arguments, I think we should have Language.pipe() and Language.__call__ take a pipe_kwargs argument, keyed by the component name. This would let us namespace the settings for each component. I think this is important because passing a flat ball of params to each component will surely get ugly. When adding another component, you'll have to care about how all the other parameters are named, and you'll find "all the good names are taken".

I hadn't thought of that approach and agree that is the better way to go. Without the ability to pass params for switching I guess one would have to rely storing these variables in user_data?

If you want one pipe to send a message to a future pipe, then yes you could set the flag in .user_data. If it were me, I'd sleep better if I could base the downstream logic on the actual annotations --- but that's up to you.

Thanks for passing along your article. I need to spend a little time with it to really grok it.

It's unlikely to be a good use of your time :). I just attached it for completeness.

christian-storm commented 6 years ago

Apologies if I came across as overselling sbd ;) I guess I wanted to air some of my thinking on the matter.

Why pass these on a per-document basis, though? If the parameters are per component there's no problem: you assemble a pipeline, save it to disk fully configured, and wrap it as a pip package. If you need to have code to assemble the pieces, you can put the logic in the package's load() function, along with any necessary parameters. Then you can do nlp = my_pipeline.load().

I agree that would work. It does seem like a lot of effort if, in the hypothetical, the only difference is one boolean flag. Not very DRY. For scripts this might be acceptable but not for servers. An industrial perspective might be helpful here.

My reaction is based in part on what I encountered while CTO of Turnitin, a company I co-founded. We processed 10-100's of millions of documents a day ranging from web pages, periodicals, books, student papers, ... in 30+ languages. An architectural choice I often had to face was choosing between stables of task specific NLP servers with an intelligent request router in front of it or a pool of generalized servers that can handle any request. The former is easier to tune and optimize/provision for at the expense of complexity and being brittle. The latter is a lot easier to scale, manage, and maintain high availability for in order to meet the almighty SLA (uptime, request turnaround time, etc.). Unless there are serious budget constraints, over-provisioning solves for the latter and more than makes up for itself in a reduced sys-admin/devops costs and increased uptime.
As a golden rule, once a server is spun up it should never have to touch disk, e.g., load a new model. As of right now, with spacy, one already has to load a new model for each language. So that would already require 30+ unique instances. Now times that by the variations in sbd, tokenization, etc. The combinatorics are scary.

Not to oversell or anything :)

If we need to pass a lot of per-document arguments, I think we should have Language.pipe() and Language.call take a pipe_kwargs argument, keyed by the component name. This would let us namespace the settings for each component. I think this is important because passing a flat ball of params to each component will surely get ugly. When adding another component, you'll have to care about how all the other parameters are named, and you'll find "all the good names are taken".

A big fat yes to all that.

If you want one pipe to send a message to a future pipe, then yes you could set the flag in .user_data. If it were me, I'd sleep better if I could base the downstream logic on the actual annotations --- but that's up to you.

As usual, you are absolutely correct. There should only be one method of maintaining state in the pipeline and that should be the doc annotations.

It's unlikely to be a good use of your time :). I just attached it for completeness.

The more I know the more I (hopefully) can be useful. I'm really keen on spacy, respect all the work you guys have done, and hope to contribute to its success.

christian-storm commented 6 years ago

I've been thinking about the pipeline/factories/etc. a lot and wanted to throw an outsiders idea at you to see what you think. I tried my best to draw from my experience of having used oodles of terrible to fantastic 3rd party libraries/services/etc. and architecting and using a team to build, configure, deploy, and maintain a large distributed NLP system.

In short, I was wondering if you considered making the pipeline explicitly pluggable? It seems the current system is half way there but, I feel, suffers from growing out of making v1 more extensible rather than starting from a fresh redesign. The good news is that I think most of the ingredients are already in place but need to be further codified around one pipeline factory pattern. I'm sure I'm missing some bits and pieces, overlooked some things, and have baked in a few inconsistencies but this is the general idea and motivation for it.

The high level idea is that each factory would explicitly define what it expects as input in order to perform it's task and defines the output it'll generate- the makings of a well defined API! Factories are coded to explicitly state the exact variables (and variable type for good measure) they require from the other process(es). A pipeline is then instantiated as a unordered list of factories that is compiled to make a runnable pipeline that takes text as its input. A sort of make file if you will. By design the 'computation graph' needn't be linear (branches/separate cores to speed up processing) but that's getting ahead of ourselves. First let me wet your whistle with what I think a great pluggable system looks like in python- Luigi. Just r/workflows/pipelines and r/tasks/factories/. A lot more grokable for a newbie like me to get up and running.

An object, e.g., doc, would still be the lingua franca of the pipeline. Although I'd open up the pipeline before tokenization to include pre-processing, i.e., text -> text, so that factories could proceed tokenization. What a pipeline expects as its object type would be defined, i.e., bytes, text, doc object, ... To allow for factories that are pre-tokenization I'd associate the vocab with a run of the pipeline and referenced by the doc. Similar to the current case but semantically different. The same is true with when a factory has a parameterized model. It too would be associated with a run of the pipeline and referenced by the doc. If a new vocab and/or factory model is required for a new style, content area, language and dialect of writing, speech to text output, etc., it would be lazy loaded on first request/cache miss and put on a LRU for later use so memory can be managed. To appease the industrial titans of the world that shudder to think that any request would suffer from a lazy load (what about their SLA!), there would be an option to pre-warm the cache. Document level annotations/data would be stored/accessed from the factory namespace and a namespaced pipe_kwargs could be passed to a pipeline run to alter its default behavior, switch out models, vocabs, flip some switches, etc.

Each factory would have a variable namespace that is public and private to delineate the two. I'd further delineate the public into input and output to but that may be me being too anal. The _init_ would require the initial vocab and model path (if parameter driven model) but I could see those being defined a run time as well, requires would define all the factory.vars that are required, __call__/__run__ would run the factory on the input, and output would define the public output variables and return the object (doc, text, ...).

There are a number of ways to list and document the factory requirements. Luigi does it one way but seemingly annotations world work as well. Doc strings would naturally live there as well and would .

Upon defining a pipeline it would be compiled at run time to figure out the ordering of components and if all the requirements are met. If ill-defined it would Fail Fast instead of dying somewhere down the line. Well, it could still be ill-defined, e.g., using a previous factory's private variable that was removed with a new version instead of asking the factory's maintainer that they need that variable added to the public interface. However, at least there is a clear contract and way to define and refine the documentation of that contract. To ferret out the rule benders, I suppose one could get draconian in 'test' mode and clear or rename all but the public variables to ensure compliance.

It would make the code a lot easier to understand (fewer questions and more contributors), properly silo each component, and allow for shareable factories. Getting a little ahead of myself but one could imagine the output method routing requests to different machines via protocol buffers/json. Spacy would be able to get deployed in a microservices based architecture like this but run just as easily as a monolithic application on a developers laptop with a flip of the devel_mode switch.

I hope this was somewhat intelligible and helpful.

Thanks for listening.

honnibal commented 6 years ago

I think this is interesting. A quick question.

What would you do about the "soft dependency" you get from adapting a machine learning model to its input? If I train the dependency parser on tokenized and tagged text, I don't just want "tokenized and tagged text". I ideally want exactly the output of exactly that tokenizer and tagger. I might settle for something similar, but in other cases I might not.

kootenpv commented 6 years ago

I believe it would warrant a new issue.

christian-storm commented 6 years ago

I'm probably haven't had enough coffee yet but I'm not 100% sure I get the issue you are raising. When you say "adapting a ml model its input" do you mean further tuning an established model to adapt it to a new data set/domain thereby creating a new model? When it comes to not getting exactly the output you were expecting, I'm failing to conjure up a situation where that would happen.

honnibal commented 6 years ago

Well, let's take the case of the dependency parser in v1. If we were to list out the input state it wants naively, it'd be something like:

Now, we can obviously ask for the POS tags to reference a particular scheme, lexemes from the right vocabulary, etc. But actually our needs are much more specific. If I go and train the parser with tags produced by process A, and then send you the weights, and you go and produce tags using process B, you might get unexpectedly bad parsing results, even though you gave it inputs that met the specifications the component declared. It could even be that your Process B tagger was much more accurate than the Process A tagger the parser was trained with. So long as Process B is just different, we could get train/test skew that makes the model perform poorly.

Another example, maybe simpler: Let's say you've got a movie review analysis model, that tries to assign star ratings to individual components, e.g. acting, plot, etc. You use NER labels as features, and the model ends up learning that reviews which mention people a lot are more likely talking about the acting. You see your NER model is making lots of mistakes, so you replace it with a better one --- but the better NER model produces really terrible analysis results. Why? Well, the better NER model might be detecting twice as many person entities in the reviews. You can't just plug the new entity recogniser into the pipeline without retraining everything downstream of it --- if you do, your results will be terrible.

Another example: If you want two tokenizers, one which has "don't", "isn't", etc as one token and another which has it as two tokens, you probably want two copies of the pipeline. If the downstream models haven't been trained with "don't" as one word, well, that word will be completely unseen --- so the model will probably guess it's a proper noun. All the next steps will go badly from there.

The problem is more acute with neural networks, if you're composing models that should communicate by tensor. If you train a tagger with the GloVe common crawl vectors, and then you swap out those vectors for some other set of vectors, your results will probably be around the random chance baseline.

So if you chain together pretrained statistical models, there's not really any way to declare the "required input" of some step in a way that gives you a pluggable architecture. The "required input" is "The result of applying exactly these components, and no others".

That's also why we've been trying to get this update() workflow across, and trying to explain the catastrophic forgetting problem etc. The pipeline doesn't have to be entirely static, but you might have to make updates after modifying the pipeline. For instance, it could be okay to change the tokenization of "don't" --- but only if you fine-tune the pipeline after doing so.

christian-storm commented 6 years ago

I appreciate the examples, they surfaced some very valid issues. The issues you raise fall into a few camps: Preventing folks from shooting themselves in the foot, defining expected behavior/model selection, and explicitly capturing static dependencies. Let's see if can motivate some solutions.

What I defined before was an api between components. Much like how a bunch of microservices would all agree on what protobuf definitions and versions they'll use to communicate with each other. It is a communication contract full stop. You are absolutely correct, it does nothing to define the expected behavior of the computational graph in turning inputs to outputs. Nor does it define what dependencies are required by each component. We have the technology. We can rebuild him.

What's the diff between pure software packages and ML based packages? In programming it is incumbent on the developer to pick the appropriate package for a given task by looking at the documented behavior of that component. Furthermore, unit and integration tests are written to ensure the expected behavior, or a representative sample of it, remains intact as package versions are bumped and underlying code is modified. If you are publishing a package like spacy you ensure proper behavior for the user by explicitly listing each required package and version number(s) in requirements.txt.

Riffing off your example, if a developer is selecting an appropriate tokenizer from, say, the available NTLK tokenizers they could look at the docs to see if it splits contractions or not. Even if it was unspecified they could figure out which one is best through some testing or by looking at the source if it is open source. If a developer chooses the wrong tokenizer for their task and doesn't have a test suite to alert them to this fact, I would have to say the bug is on them, no? What if the tokenizer is closed source? Isn't this essentially the same black box as a ML based tokenizer? Well not exactly. As an example, when I used SAP's NLP software the docs detailed the rule set used for tokenization. If the tokenization is learned, the rules are inferred from the data and can't be written down. With the "don't" tokenization example, how would one know that "don't" is going to be properly handled without explicitly testing it?

Expected behavior So how does one fully specify the expected behavior of a machine learning component? As you well know I don't think anyone has a good answer for this. In academia one details the algorithm, releases the code, specifies the hyper-parameters, the data set used to train and validate, and the summary metric scores found with the test set. This information allows one to intuit how well it may do on another data set but there is no substitute for trying it out.

Imagine if one had access to an inventory of trained models. To select the best model for a given data set/task, one would compare the summary statistics of each model run against the test set. Likely one might even have a human inspect individual predictions to see ensure the right model is being selected (for example). If the model seems like it would benefit from domain adaptation, further training in a manner that avoids catastrophic forgetting might prove effective.

As alluded to by your examples, what if the developer doesn't have a labeled test set to aid in model selection? My knee jerk reaction is that they are ill equipped to stray from the default setup. They should use Prodigy to create a test set first. To me it is equivalent to someone picking NLTK's moses over casual tokenizer for twitter data without running tests to see which is better. This may be a bit far afield but a solution could be to ship a model with a set of sufficient statistics that describes the distribution of the training corpus, a program to generate the statistics for a new corpus, and a way of comparing the statistics for fit/transferability (KL divergence and outliers?). For tokenization and tagging, a first approximation would be the distribution of token and POS. So if the training set didn't have "didn't" but the user's corpus does, it would alert them to that fact and they could build a test to make sure it behaves as expected and possibly give them a motivation to further train the model. It might prevent some from shooting themselves in the foot by aiding them in the model selection process.

Versioning and dependencies In devops one has to specify the required libraries, packages, configuration files, os services, etc. required to turn a bare metal box into the working environment needed to run a certain piece of software. This is notoriously hard to do as evidenced by the sheer number of configuration tools that exist (Puppet, cfengine, chef, etc.) and next generation tools (Docker VE, VM, ...) that give up on trying to turn the full configuration of an environment into source code. I've been in dependency hell and it sucks.

So how do we ensure that each ML model is reproducible? Much in the same way versions of spacy depend on certain versions of the model as defined in compatibility.json. A model would specify what Glove vectors were used, the corpus it was trained on (or a pointer to it, e.g., ontonotes 5.0), the hyper-parameter settings, etc. Anything and everything needed to recreate that particular version of the model from scratch. Something along the lines of dataversioncontrol. To cordon off model specific data in spacy, the data would be stored in a private namespace for that model/factory. Better yet, to allow for shared data amongst models, like Glove vectors, vocab, etc., the data would be stored in a private global namespace with the model instances having pointers from its private namespace. Much like how a doc.vocab points to a gloabl vocab instances. The difference being that everything would be versioned (hash over data with a human readable version number for good measure).

Now let me walk through each of your examples to see how this further refined concept might address each situation.

Now, we can obviously ask for the POS tags to reference a particular scheme.

Yes, exactly. It would point to a versioned file.

But actually our needs are much more specific. If I go and train the parser with tags produced by process A, and then send you the weights, and you go and produce tags using process B, you might get unexpectedly bad results. It doesn't have to be a simple story of "process A was more accurate than process B".

I'm not entirely sure what you mean by process A and B. Corpus A and corpus B? You shouldn't ever pass weights around. You'd load model A's weights, tags, etc. and never change them. If you did, it would be a new model.

Another example: If you want two tokenizers, one which has "don't", "isn't", etc as one token and another which has it as two tokens, you probably want two copies of the pipeline.

Maybe I'm playing antics with semantics but I wouldn't say "copies" here. One may have two separate tokenizers or possibly one with a switch if it is ruled based. With the learned tokenizers the run of the pipeline would specify which tokenizer to use. The pipeline is defined, compiled, and run. If a different pipeline is required, e.g., a different tokenizer, the pipeline is redefined and recompiled before running. The components would be cached so it would be fast and, to your point, perhaps there is cache of predefined pipelines as well if compilation proves expensive.

If the models haven't been trained with "don't" as one word, well, that word will be completely unseen --- so the model will probably guess it's a proper noun. All the next steps will go badly from there.

Agreed. That is why model selection is so important and needs to be surfaced as a step in the development process. The situation you describe is applicable to current spacy users. Without access to Ontonotes how does one know how if it is close enough to their domain to be effective at, say, parsing? Even if one did have access to Ontonotes how does one judge how transferable the models are? One could compare vocabulary, tag, dependency overlap and their frequencies. But nothing trumps a run against the test set, right?

The problem is more acute with neural networks, if you're composing models that should communicate by tensor. If you train a tagger with the GloVe common crawl vectors, and then you swap out those vectors for some other set of vectors, your results will probably be around the random chance baseline.

Yes, that would be disastrous and shouldn't be allowed or, with the limitations of python, discouraged. The vectors are part of the model definition and when loaded would reside in a private namespace. Of course, nothing can be made private in python so someone could blow through a few stop signs and still shoot themselves in the foot.

So if you chain together pretrained statistical models, there's not really any way to declare the "required input" of some step in a way that gives you a pluggable architecture. The "expected input" is "What I saw during training, as exactly as possible".

I believe what you are saying that there is no way to ensure each model is trained on the same data set, no? In other words, to get the reported results, the "expected input" needs to be distributionally similar to the training data. If this is what you mean, one could have an optional check in the compilation step that checks to make sure the datasources are the same across the pipeline. This would prevent some noob, only looking at reported accuracies when creating a pipeline, from chaining together a twitter based model with a model trained on arxiv.

That's also why we've been trying to get this update() workflow across, and trying to explain the catastrophic forgetting problem etc. The pipeline doesn't have to be entirely static, but you might have to make updates after modifying the pipeline. For instance, it could be okay to change the tokenization of "don't" --- but only if you fine-tune the pipeline after doing so.

Agreed. However, if you decide to domain adapt the model, i.e., online learning with prodigy, this should produce new version of the model with a model definition that points to new parameters and a new data source listing that includes the original data source and the new data source.

Despite the length of this response, what I'm talking about really isn't that complicated in concept and from what I can tell not too far afield from where spacy 2.0 is now. I'd be willing to chip in if that is helpful. It'll be much more difficult once the ship leaves port.

I'm curious to hear what you think?

honnibal commented 6 years ago

What's the diff between pure software packages and ML based packages?

The difference I'm pointing to is there's no API abstraction possible with ML. We're in a continuous space of better/closer, instead of a discrete space of match/no match.

If you imagine each component as versioned, there's no room for a range of versions --- you have to specify an exact version to get the right results, every time. Once the weights are trained nothing is interchangeable and ideally nothing should be reconfigured.

This also means you can't really usefully cache and compose the pipeline components. There's no point in regestering a component like "spaCy parsing model v1.0.0a4" on its own. The minimum versionable unit is something like "spaCy pipeline v1.0.0a4", because to get the dependency parse, you should run exactly the fully-configured tokenizer and tagger used during training, with no change of configuration whatsoever.

We can version and release a component that provides a single function nlp(text) -> doc with tags and deps. We can also version and release a component that provides a function train(exampes, config) -> nlp. But we can't version and release a component that provides functions like parse(doc_with_tags) -> doc_with_deps

I've been trying to keep the pipelines shorter in v2 to mitigate this issue, so things are more composable. The v2 parser doesn't use POS tag features anymore, and the next release will also do away with the multi-task CNN, instead giving each component its own CNN pre-process. This might all change though. If results become better with longer pipelines, maybe we want longer pipelines.

christian-storm commented 6 years ago

Keeping the conversation going...I really hope this isn't coming across as adversarial or grating in anyway. I actually think we are getting somewhere and agree on most things.

The difference I'm pointing to is there's no API abstraction possible with ML. We're in a continuous space of better/closer, instead of a discrete space of match/no match.

Well put and agree in principle with the caveat that code is rarely so boolean. Take a complex signal processing algorithms where the function F is either learned or is programmed with an analytic/closed form solution. How is the test and verify process of either really any different? Sure, each component of the latter can be tested individually. That certainly makes it easier to debug when things go south. However, a test set of Y = F(X) is as important in either case, right?

If you imagine each component as versioned, there's no room for a range of versions --- you have to specify an exact version to get the right results, every time. Once the weights are trained nothing is interchangeable and ideally nothing should be reconfigured.

Once gain, on the same page although I think there is another way to look at it. The key is defining exactly what the "right results" are. In building a ML model one uses the validation set to make sure the model is learning but not overfitting the training set. Then the test set is used as the ultimate test to ensure the model is transferable. If one were to pull two models off the shelf and plug them together as I've been suggesting, you'd judge the effectiveness of each using a task specific test set the two together using a test set that encompasses the whole pipeline, no? This happens all the time in ML, e.g., a speech to text system that uses spectrograms, KenLM, and DL. Even though the first two aren't learned, though they could be, there are a bunch of hyper-parameters that need to be "learned."

This also means you can't really usefully cache and compose the pipeline components. There's no point in regestering a component like "spaCy parsing model v1.0.0a4" on its own. The minimum versionable unit is something like "spaCy pipeline v1.0.0a4", because to get the dependency parse, you should run exactly the fully-configured tokenizer and tagger used during training, with no change of configuration whatsoever.

I would agree that training end to end and freezing the models in the pipeline afterwards leads to the most reproducible results. If this is the intended design, one will only ever be able to disable or append purely additive components, e.g., sentiment.

Just to play a little devils advocate here, spacy promotes the ability to swap out the tokenizer that feeds the pipeline without a warning or a mention that one should retrain. Isn't this contrary to your end-to-end abstraction? What if someone blindly decides to implement that whitespace tokenizer described in the docs? To use your example, spacy might starting labeling "don't" as a proper noun, no? The same could be said about adding special cases for tokenization? You are performing operations that weren't performed on the training data!

If the dependencies remain a constant across the pipeline, I still think plugging trained models into the pipeline makes sense if one knows what they are doing- an appropriate test harness at each step of the pipeline. On the other hand, I agree it is easy to go off the rails when components are tightly coupled, e.g., setting sent_start and making the trained parser obey them even though it wasn't trained with those sentence boundaries. However, there are many valid cases where it makes sense, e.g., training a sbd, freezing it, and then training the remaining pipeline.

Another idea With the pipeline versioning idea in mind, why not at least allow for pluggable un-trained models that, once trained, get frozen into a versioned pipeline? Ultimately, I'm looking for a tool that plays well with experimentation, e.g., a new parser design from the literature, and devops. The difference is spacy being part of the NLP pipeline versus running the entire pipeline.

We can version and release a component that provides a single function nlp(text) -> doc with tags and deps. We can also version and release a component that provides a function train(exampes, config) -> nlp. But we can't version and release a component that provides functions like parse(doc_with_tags) -> doc_with_deps

Okay, I got it :), too much configurability can lead to bad things. But, really, why can't one version and release a component like parse(doc_with_tags) -> doc_with_deps? How is it any different than training each stage of the pipeline, freezing it, and then training the next stage of the pipeline using the same dependencies: data, tag sets, glove vectors, etc.? If trained end-to-end with errors back propagated from component to component, then yes I would agree, these tightly coupled components should be thought of as one unit and domain adapted as one unit.

Right now, if I wanted to change anything beyond the tokenizer in the pipeline it is non-trivial. However, I'm starting to realize that I maybe barking up the wrong tree here. Looking at prodigy and the usage docs for spacy, only downstream classification models (sentiment, intent, ...) are ever referenced. What if I want to add a semantic role labeler that requires a constituency parse? Or better yet what if someone publishes a parser that is much more accurate and I really could use that extra accuracy? I guess I'm back to building my own NLP pipeline.

I've been trying to keep the pipelines shorter in v2 to mitigate this issue, so things are more composable.

Yes! That is the word- composable. "A highly composable system provides components that can be selected and assembled in various combinations to satisfy specific user requirements."

That's it! I would love a world where I can truly compose a NLP pipeline. Analogous to how Keras allows you easily build, train, and use a NN; just one level of abstraction higher.

I don't see how "shorter" pipelines are more composable though. Forgive me if I'm wrong but I don't really see any composability in spacy at the moment. Maybe configurability? Though, one gets the impression by reading the docs, "you can mix and match pipeline components," that the vision is to be able to compose pipelines that deliver different behaviors (specific user requirements).

The v2 parser doesn't use POS tag features anymore, and the next release will also do away with the multi-task CNN, instead giving each component its own CNN pre-process. This might all change though. If results become better with longer pipelines, maybe we want longer pipelines.

I wish I knew the code better to react.

I'm already in a fairly precarious position needing different tokenizers and sentence boundary detectors and there isn't a clear way to add these components. With your previously proposed solution of breaking and merging the dependency tree to allow for new sentence boundaries, what would that do to accuracy? Isn't this the exact tinkering of a trained model you are trying to avoid?

Once again, thanks for engaging Matthew.

honnibal commented 6 years ago

Keeping the conversation going...I really hope this isn't coming across as adversarial or grating in anyway. I actually think we are getting somewhere and agree on most things.

No, not at all -- I hope I'm not coming across as intransigent :)

Just to play a little devils advocate here, spacy promotes the ability to swap out the tokenizer that feeds the pipeline without a warning or a mention that one should retrain. Isn't this contrary to your end-to-end abstraction? What if someone blindly decides to implement that whitespace tokenizer described in the docs? To use your example, spacy might starting labeling "don't" as a proper noun, no? The same could be said about adding special cases for tokenization? You are performing operations that weren't performed on the training data!

I do think this is a potential problem, and maybe we should be clearer about the problem in the docs. The trade-off is sort of like having internals prefixed with an underscore in Python: it can be useful to play with these thing, but you don't really get safety guarantees.

Right now, if I wanted to change anything beyond the tokenizer in the pipeline it is non-trivial. However, I'm starting to realize that I maybe barking up the wrong tree here. Looking at prodigy and the usage docs for spacy, only downstream classification models (sentiment, intent, ...) are ever referenced. What if I want to add a semantic role labeler that requires a constituency parse?

We don't really have a data structure for constituency parses at the moment, or for semantic roles. You could add the data into user_data. More generally though:

Or better yet what if someone publishes a parser that is much more accurate and I really could use that extra accuracy? I guess I'm back to building my own NLP pipeline.

Well, not really? You could subclass NeuralDependencyParser and overwrite the predict or set_annotations methods. Or you could do neither, and add some function like this to the pipeline:


def my_dependency_parser(doc):
    parse = doc.to_array([HEAD, DEP])
    # Set every word to depend on the next word
    for i in range(len(doc)):
        parse[i, 0] = i+1
    doc.from_array([HEAD, DEP], parse)

nlp.pipeline is literally just a list. Currently the only assumption is that the list entries are callable. You can set up your own list however you like, with any or all of your own components. You could have a pipeline component that predicts the sentence boundaries, creates a sequence of Doc objects using slices of the Doc.c pointer for the sentences, and parses each sentence independently:


# From within Cython

class SentenceParser(object):
    def __init__(self, segmenter, parser):
        self.segment = segmenter
        self.parse = parser

    def __call__(self, doc):
        sentences = self.segment(doc)
        cdef Doc subdoc
        for sent in sentences:
            subdoc = Doc(doc.vocab)
            subdoc.c = &doc.c[sent.start]
            subdoc.length = sent.end-sent.start
            self.parse(subdoc)
        return doc

I haven't tested this, but in theory it should work?

christian-storm commented 6 years ago

nlp.pipeline is literally just a list. Currently the only assumption is that the list entries are callable. You can set up your own list however you like, with any or all of your own components.

I totally get how pipelines work under the hood now. But it isn't as simple as that, right? Which brings me back to what started all this for me . If it was that easy, set_factory would be as trivial as adding a callable function to the pipeline list (#1357) and I would be able to set sentence boundaries without new ones "magically" being created.

I appreciate you sharing the recipes of how you would do it. However, this is exactly what I was trying to avoid. As part of this exercise, I am now more familiar with the code and it is a more tenable solution. I fear you are going to leave a lot of talented people behind that could contribute to spacy and box people out that find spacy unfit for their task. Most researchers won't crack the hood open and take the time to learn cython and the inner-workings of the spacy engine just so they can add or modify a part. I think there is an opportunity for spacy to create an ecosystem much like scikit learn's which currently has 932 contributors and a clear path for becoming one.

At any rate, I'll get off my soapbox now. I'm anxiously awaiting how or if you'll solve for the sbd issue. As of right now I'm dead in the water with spacy because of it. Trying to decide if I move on or hang tight.

honnibal commented 6 years ago

Well, I think there's a mix of a couple of issues here. One is that the SBD stuff is legit broken at the moment --- it's one of the tickets blocking spaCy 2 stable. Similarly the set_factory thing doesn't work as advertised at the moment either.

But the more interesting thing are these deeper design questions, about how the pipeline works, and to what extent we should expect components to be "hot swappable", how versioning should work, whether we can have a pluggable architecture, etc.

I agree that having me suggest Cython code isn't a scalable approach to community development :p. On the other hand, some of the problems aren't scalable/general here --- there are specific bugs, for which I'm trying to give specific mitigations.

About the more general questions: I think we should probably switch to using entry points to give a more explicit plugin infrastructure, for both the languages and the components. We also plan to have wrapper components for the common machine learning libraries, to make it easy to write a model with say PyTorch and use it to power a POS tagger. The next release of the spaCy 2 docs will also have more details about the Pipe abstract base class.

I probably don't think I want something like the declarative approach to pipelines that you mentioned above, though. I think if you want that sort of workflow, the best thing to do would be to wrap each spaCy component you're interested in as a pip package, and then use Luigi or Airflow as the data pipeline layer.

The components you wrap this way can take a Doc object instead of text if you like --- you just have to supply a different tokenizer or make_doc function. So, you don't need to repeat any work this way. You can make the steps you're presenting as spaCy pipelines as small or as big as you like. I think this will be better than designing our own pipeline management solution.

honnibal commented 6 years ago

There's also some relevant discussion about extensibility in #1085 that might be interesting.

christian-storm commented 6 years ago

Yeah, I had read #1085 as part of my due diligence trying to wrap my head around all this.

I'm heartened to hear sbd is on the radar and some thought is being given to entry points/pluggable architecture and a pipe abstract class. It is hard to arrive at the right abstraction but it'll be well worth it in the long run. On the same page with respect to the vision and using the right tool for the job, e.g., pipeline management. I'll stop bugging you so you and I can get back to being productive. :)

cbrew commented 6 years ago

I think this little fragment ought to work. But it doesn't. Something seems to be wrong with the saving of the added pipeline component.

I have spacy 2.0.0a16 installed in a fresh conda environment with python 3.6.2 from conda-forge

import spacy
import spacy.lang.en
from spacy.pipeline import TextCategorizer

nlp = spacy.lang.en.English()
tokenizer = nlp.tokenizer
textcat = TextCategorizer(tokenizer.vocab, labels=['ENTITY', 'ACTION', 'MODIFIER'])
nlp.pipeline.append(textcat)
nlp.to_disk('matter')

error is

Traceback (most recent call last):
  File "loadsave.py", line 10, in <module>
    nlp.to_disk('matter')
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/language.py", line 507, in to_disk
    util.to_disk(path, serializers, {p: False for p in disable})
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/util.py", line 478, in to_disk
    writer(path / key)
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/language.py", line 505, in <lambda>
    serializers[proc.name] = lambda p, proc=proc: proc.to_disk(p, vocab=False)
  File "pipeline.pyx", line 190, in spacy.pipeline.BaseThincComponent.to_disk
  File "/Users/cbrew/anaconda3/envs/spacy2/lib/python3.6/site-packages/spacy/util.py", line 478, in to_disk
    writer(path / key)
  File "pipeline.pyx", line 188, in spacy.pipeline.BaseThincComponent.to_disk.lambda7
TypeError: Required argument 'length' (pos 1) not found
honnibal commented 6 years ago

@cbrew Thanks. Seems to be a bug in .to_bytes() --- the same happens even without adding the model to the pipeline.

Edit: Okay I think I see the issue. After __init__() the component's .model attribute won't be created yet. It's added in a second step, after you call either begin_training() or load with from_bytes() or from_disk().

I think this is leading to incorrect behaviour when you immediately try to serialize the class.

Edit2:


>>> a = True
>>> a.to_bytes()
Traceback (most recent call last):
  File "<stdin>", line 1 in <module>
TypeError: Required argument 'length' (pos 1) not found

So to_bytes() happens to clash with a method on the bool type. Sometimes dynamic typing feels like a terrible bad no good idea...

jamesrharwood commented 6 years ago

Is it possible to run Spacy functions on a redis backed worker? I'm finding that my jobs disappear as soon as they reach the nlp() command. For instance:

#### worker.py

import redis
from rq import Worker, Queue, Connection

conn = redis.from_url("redis://localhost:6379")
with Connection(conn):
    worker = Worker(list(map(Queue, ['default'])))
    worker.work()
### test.py

import spacy
nlp=spacy.load('en_core_web_sm')

def test_nlp():
    print "before NLP call"
    r = nlp(u"this is a test")
    print "after NLP call"
    return r

Running python worker.py and then the following:

from rq import Queue
from worker import conn
from test import test_nlp

q = Queue(connection=conn)
q.enqueue(test_nlp)

Results in the worker printing:

21:58:45 *** Listening on default...
21:59:07 default: test_nlp() (61b65f56-4a02-42b2-bdfd-07a5bc7bceb6)
before NLP call
21:59:08
21:59:08 *** Listening on default...

The second print statement never appears, and if I query the job status it confirms that it's started, but not finished and not failed.

Am I missing something obvious?

spacy-nightly: 2.0.0a16 rq: 0.6.0 redis: 2.10.5

UPDATE USING CELERY INSTEAD OF RQ

Using Celery instead of RQ, I now get this error:

[2017-10-12 11:12:18,412: INFO/MainProcess] Received task: test_nlp[1a48e949-b1ba-4820-aed9-b7b44a1fed1f]
[2017-10-12 11:12:18,417: WARNING/ForkPoolWorker-3] before NLP call
[2017-10-12 11:12:19,701: ERROR/MainProcess] Process 'ForkPoolWorker-3' pid:11820 exited with 'signal 11 (SIGSEGV)'
[2017-10-12 11:12:19,718: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)

This Celery thread suggests it may be a problem with Spacy not being fork safe: https://github.com/celery/celery/issues/2964#issuecomment-165569834

I tried the workaround suggested in the linked comment (importing the spacy model inside the function) but the import causes the same error.

PROBLEM SOLVED?

I tried pip install eventlet and then running the celery worker with -P eventlet -c 1000 and now the task runs successfully!

I'm not sure whether this means it's a bug within prefork or Spacy, so I'm leaving this comment here in the hope that it helps someone!

nathanathan commented 6 years ago

Sentence span similarity isn't working for me in spacy-nightly 2.0.0a16:

import en_core_web_sm as spacy_model
spacy_nlp = spacy_model.load()
sent_list = list(spacy_nlp(u'I saw a duck at the park. Duck under the limbo stick.').sents)
sent_list[0].similarity(sent_list[1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "span.pyx", line 134, in spacy.tokens.span.Span.similarity
  File "span.pyx", line 231, in spacy.tokens.span.Span.vector_norm.__get__
  File "span.pyx", line 216, in spacy.tokens.span.Span.vector.__get__
  File "span.pyx", line 112, in __iter__
  File "token.pyx", line 259, in spacy.tokens.token.Token.vector.__get__
IndexError: index 0 is out of bounds for axis 0 with size 0