explosion / spaCy

πŸ’« Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.31k stars 4.41k forks source link

πŸ’« spaCy v2.0.0 alpha – details, feedback & questions (plus stickers!) #1105

Closed ines closed 7 years ago

ines commented 7 years ago

We're very excited to finally publish the first alpha pre-release of spaCy v2.0. It's still an early release and (obviously) not intended for production use. You might come across a NotImplementedError – see the release notes for the implementation details that are still missing.

This thread is intended for general discussion, feedback and all questions related to v2.0. If you come across more complex bugs, feel free to open a separate issue.

Quickstart & overview

The most important new features

Installation

spaCy v2.0.0-alpha is available on pip as spacy-nightly. If you want to test the new version, we recommend setting up a clean environment first. To install the new model, you'll have to download it with its full name, using the --direct flag.

pip install spacy-nightly
python -m spacy download en_core_web_sm-2.0.0-alpha --direct   # English
python -m spacy download xx_ent_wiki_sm-2.0.0-alpha --direct   # Multi-language NER
import spacy
nlp = spacy.load('en_core_web_sm')
import en_core_web_sm
nlp = en_core_web_sm.load()

Alpha models for German, French and Spanish are coming soon!

Now on to the fun part – stickers!

stickers

We just got our first delivery of spaCy stickers and want to to share them with you! There's only one small favour we'd like to ask. The part we're currently behind on are the tests – this includes our test suite as well as in-depth testing of the new features and usage examples. So here's the idea:

Submit a PR with your test to the develop branch – if the test covers a bug and currently fails, mark it with @pytest.mark.xfail. For more info, see the test suite docs. Once your pull request is accepted, send us your address via email or private message on Gitter and we'll mail you stickers.

If you can't find anything, don't have time or can't be bothered, that's fine too. Posting your feedback on spaCy v2.0 here counts as well. To be honest, we really just want to mail out stickers πŸ˜‰

kootenpv commented 7 years ago

I honestly just wanted it to work :-)

I had spacy 1.5 installed on my other machine, I removed it and installed spacy-nightly v2.0.0a0. It works to import, but then tried to download with both:

python -m spacy download en
python -m spacy download en_core_web_sm

Both give:

Compatibility error
No compatible models found for v2.0.0a0 of spaCy.

It's most likely that I'm missing something?

EDIT: Indeed, on the release page it says: en_core_web_sm-2.0.0-alpha. You also need to give the --direct flag.

python -m spacy download en_core_web_sm-2.0.0-alpha --direct

Perhaps it is possible to temporarily update the docs for it?

Otherwise: it works!

ines commented 7 years ago

Sorry about that! We originally decided against adding the alpha models to the compatibility table and shortcuts just yet to avoid confusion – but maybe it actually ended up causing more confusion. Just added the models and shortcuts, so in about 5 minutes (which is roughly how long it takes GitHub to clear its cache for raw files), the following commands should work as well:

python -m spacy download en
python -m spacy download xx
kootenpv commented 7 years ago

Another update: I tried parsing 16k headlines. I can parse all of them* and access some common attributes of each of them, including vectors :)

I did notice that on an empty string (1 of headlines*), it now throws an exception, this was not the case in v.18.2. Probably better to fix that :)

I wanted to do a benchmark against v.1.8.2, but the machines are not comparable :( It did feel a lot slower though...

honnibal commented 7 years ago

Thanks!

Try the doc.similarity() if you have a use-case for it? I'm not sure how well this works yet. It's using the tensors learned for the parser, NER and tagger (but no external data). It seems to have some interesting context sensitivity, and in theory it might give useful results --- but it hasn't been optimised for that. So, I'm curious to hear how it does.

http://alpha.spacy.io/docs/usage/word-vectors-similarities

buhrmann commented 7 years ago

Hi, very excited about the new, better, smaller and potentially faster(?) spaCy 2.0. I hope to give it a try in the next days. Just one question. According to the (new) docs the embeddings seem to work just as they did before, i.e. external word vectors and their averages for spans and docs. But you also mention the use of tensors for similarity calculations. Is it correct that the vectors are essentially the same, but are not used as such in the similarity calculations anymore? Or are they somehow combined with the internal tensor representations of the documents? In any case thanks for the great work, and I hope to be able to give some useful feedback soon about the Spanish model etc.

honnibal commented 7 years ago

@buhrmann This is inherently a bit confusing, because there are two types of vector representations:

  1. You can import word vectors, as before. The assumption is you'll want to leave these static, with perhaps a trainable projection layer to reduce dimension

  2. The parser, NER, tagger etc learns a small embedding table and a depth-4 convolutional layer, to assign the document a tensor with a row for each token in context.

We're calling type 1 vector, and type 2 tensor. I've designed the neural network models to use a very small embedding table, shared between the parser, tagger and NER. I've also avoided using pre-trained vectors as features. I didn't want the models to depend on, say, the GloVe vectors, because I want to make sure you can load in any arbitrary word vectors without messing up the pipeline.

buhrmann commented 7 years ago

Thanks, that's clear now. I still had doubts about how the (type 1) vectors and (type 2) tensors are used in similarity calculations, since you mention above the tensors could have interesting properties in this context (something I'm keen to try). I've cleared this up looking at the code and it seems that the tensors for now are only used in similarity calculations when there are no word vectors available (which of course could easily be changed with user hooks).

unography commented 7 years ago

Hi,

I wanted to do a quick benchmark between spaCy v1.8.2 and the v2.0.0. First of all, the memory usage is amazingly less in the new version! The old version's model took approximately 1GB memory, while the new one about 200MB.

However, I noticed that the latest release is using all 8 cores of my machine (100 % usage), but it is remarkably very, very slow!

I made two separate virtualenv to make sure the installation was clean. This is a small code I wrote to test it's speed -

import time
import spacy
nlp = spacy.load('en')

def do_lemma(text):
    doc = nlp(text.decode('utf-8'))
    lemma = []
    for token in doc:
        lemma.append(token.lemma_)
    return ' '.join(lemma)

def time_lemma():
    text = 'mangoes bought were nice this time'  # just a stupid sentence
    start = time.time()
    for i in range(1000):
        do_lemma(text)
    end = time.time()
    print end - start

time_lemma()

And for the latest release, same code only the model imports changed -


import time
import spacy
nlp = spacy.load('en_core_web_sm')

def do_lemma(text):
    doc = nlp(text.decode('utf-8'))
    lemma = []
    for token in doc:
        lemma.append(token.lemma_)
    return ' '.join(lemma)

def time_lemma():
    text = 'mangoes bought were nice this time'  # just a stupid sentence
    start = time.time()
    for i in range(1000):
        do_lemma(text)
    end = time.time()
    print end - start

time_lemma()

The first (v1.8.2) runs in 0.15 seconds while the latest (v2.0.0) took 11.77 seconds to run! Is there something I'm doing wrong in the way I'm using the new model?

honnibal commented 7 years ago

Hm! That's a lot worse than my tests, but in my tests I used the .pipe() method, which lets the model minibatch. This helps to mask the Python overhead a bit. I still think the result you're seeing is much slower than I expect though.

A thought: Could you try setting export OPENBLAS_NUM_THREADS=1 and trying again? If your machine has lots of cores, it could be that the stupid thing tries to load up like 40 threads to do this tiny amount of work per document, and that kills the performance.

unography commented 7 years ago

@honnibal Hi, setting export OPENBLAS_NUM_THREADS=1 surely helped! It avoided that 100% usage but it is still slower than the old guy. Now it takes about 4 seconds to run, way faster than before but still slow.

slavaGanzin commented 7 years ago

I just finished reading documentation for v2.0 and it's way better than for v1.*.

But this export OPENBLAS_NUM_THREADS=1 looks new for me. I thought blas used by numpy only to train vectors. Could this be documented?

honnibal commented 7 years ago

@slavaGanzin The neural network model makes lots of calls to numpy.tensordot, which uses blas -- both for training and runtime. I'd like to have set this within the code --- even for my own usage I don't want to micromanage this stupid environment variable! The behaviour of "Spin up 40 threads to compute this tiny matrix multiplication" is one that nobody could want. So, we should figure out how to stop it from happening.

@eldor4do What happens if you use .pipe() as well?

unography commented 7 years ago

@honnibal I'll try with .pipe() once, but in my actual use case I won't be able to use pipe(), it would be more like repeated calls.

honnibal commented 7 years ago

Hopefully the ability to hold more workers in memory compensates a bit.

Btw, the changes to the StringStore are also very useful for multi-processing. The annotations from each worker are now easy to reconcile, because they're stored as hash IDs -- so the annotation encoding no longer depends on the worker's state.

unography commented 7 years ago

@honnibal Yes, that is a plus. Also, I tested the OPENBLAS value on 2 machines, on one it was able to reduce the threads, on the other, a 4 core machine, it failed to do so. Still all at 100% usage. Any idea what could be the problem?

honnibal commented 7 years ago

@eldor4do That's annoying. I think it depends on what BLAS numpy is linked to. Is the second machine a mac? If so the relevant library will be Accelerate, not openblas. Maybe there's a numpy API for limiting the thread count?

Appreciate the feedback -- this is good alpha testing :)

kootenpv commented 7 years ago

@honnibal What are the expected differences for your test cases?

anna-hope commented 7 years ago

1021 is still an issue with this alpha release -- all of the sentences I gave as examples fail to be parsed correctly.

alfonsomhc commented 7 years ago

I get some errors when I run this example from the documentation (https://alpha.spacy.io/docs/usage/lightning-tour#examples-tokens-sentences):

doc = nlp(u"Peach emoji is where it has always been. Peach is the superior "
          u"emoji. It's outranking eggplant πŸ‘ ")

assert doc[0].text == u'Peach'
assert doc[1].text == u'emoji'
assert doc[-1].text == u'πŸ‘'
assert doc[17:19].text == u'outranking eggplant'
assert doc.noun_chunks[0].text == u'Peach emoji'

sentences = list(doc.sents)
assert len(sentences) == 3
assert sentences[0].text == u'Peach is the superior emoji.'

There are two problems 1) This expresion doc.noun_chunks[0].text has error TypeError: 'generator' object is not subscriptable

2) This expresion sentences[0].text returns 'Peach emoji is where it has always been.' and therefore the last assertion fails

I'm using python 3.6 (and spacy 2.0 alpha)

alfonsomhc commented 7 years ago

I have also a problem with the example: https://alpha.spacy.io/docs/usage/lightning-tour#examples-pos-tags This statement fails: assert [apple.pos_, apple.pos] == [u'PROPN', 17049293600679659579] because [apple.pos_, apple.pos] returns ['PROPN', 95] The rest of the assertions are fine.

alfonsomhc commented 7 years ago

I have also a problem with the example https://alpha.spacy.io/docs/usage/lightning-tour#displacy The line displacy.serve(doc_ent, style='ent') gives the error: OSError: [Errno 98] Address already in use

I'm running it from Jupyter. I have read the documentation (https://alpha.spacy.io/docs/usage/visualizers#jupyter) and I understand that Jupyter mode should be detected automatically. I tried setting jupyter=True but I got the same error.

If it helps, Im using Jupyter 5.0

v3t3a commented 7 years ago

Hi,

I don't know if you consider the following as a bug or not, but there is a difference between v2 and v1 on creating matcher:

With v1

import spacy

nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)

With v2 and the same code, I got this error :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'spacy' has no attribute 'matcher'

Following new 101's doc, I changed my code to

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)

I think that it's a bug because I should access to Matcher just with spacy imported. What do you think about it ?

alfonsomhc commented 7 years ago

I have also a problem with the example https://alpha.spacy.io/docs/usage/lightning-tour#examples-word-vectors The first assert fails because it is false. The second assert line has an error:

File "<ipython-input-5-4d61871a144f>", line 9
    assert apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector
                                              ^
SyntaxError: invalid syntax

In any case, if I run this: apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector I get (False, False, False, False) This makes me think that the model has no word vectors and therefore the similarities are wrong? I have installed the model that you describe in this page, that is with this command: python -m spacy download en_core_web_sm-2.0.0-alpha --direct In the documentation you say:

The default English model installs 300-dimensional vectors trained on the Common Crawl corpus. so then I assume I should have word vectors?

alfonsomhc commented 7 years ago

Another problem in https://alpha.spacy.io/docs/usage/lightning-tour#multi-threaded When I run the example I get this error:

...
/home/user/anaconda3/lib/python3.6/site-packages/spacy/_ml.py in forward(docs, drop)
    248         feats = []
    249         for doc in docs:
--> 250             feats.append(doc.to_array(cols))
    251         return feats, None
    252     model = layerize(forward)

AttributeError: 'str' object has no attribute 'to_array'
v3t3a commented 7 years ago

Another one, but in this case, this is a difference between the doc and the use (Of the blow, this make a difference between v1 and v2, but it may be a choice rather than a bug).

This code

import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher = spacy.matcher.Matcher(nlp.vocab, pattern)

Give me following error :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/matcher.pyx", line 184, in spacy.matcher.Matcher.__init__ (spacy/matcher.cpp:5534)
TypeError: __init__() takes exactly 1 positional argument (2 given)

In v1, and in the Doc too, it is specified that Matcher.__init__ accept two arguments. Vocab and/or patterns

Piece of code (Matcher.init)

V1

def __init__(self, vocab, patterns={}):
        """
        Create the Matcher.

        Arguments:
            vocab (Vocab):
                The vocabulary object, which must be shared with the documents
                the matcher will operate on.
            patterns (dict): Patterns to add to the matcher.
        Returns:
            The newly constructed object.
        """
        self._patterns = {}
        self._entities = {}
        self._acceptors = {}
        self._callbacks = {}
        self.vocab = vocab
        self.mem = Pool()
        for entity_key, (etype, attrs, specs) in sorted(patterns.items()):
            self.add_entity(entity_key, attrs)
            for spec in specs:
                self.add_pattern(entity_key, spec, label=etype)

V2

def __init__(self, vocab):
        """Create the Matcher.

        vocab (Vocab): The vocabulary object, which must be shared with the
            documents the matcher will operate on.
        RETURNS (Matcher): The newly constructed object.
        """
        self._patterns = {}
        self._entities = {}
        self._acceptors = {}
        self._callbacks = {}
        self.vocab = vocab
        self.mem = Pool()
ines commented 7 years ago

@alfonsomhc Thanks for the detailed analysis – and sorry, so many stupid typos! Just fixing the peach emoji example.

About displaCy: If you're using displaCy from within a notebook, you should call displacy.render() – after all, you're already running a web server (the notebook server). I'm not actually sure what the expected behaviour of displacy.serve() in Jupyter would be... your error message mostly looks like you're already running something else on displaCy's default port 5000.

Either way, this should probably be more clear in the docs. And maybe displacy.serve should at least print a warning in Jupyter mode that tells the user that starting the webserver is not actually necessary.

About the vectors: This is currently a bit messy, sorry – the vectors that are supposed to be attached to the vocab aren't wired up correctly yet. Right now, there are only tensors (Doc.tensor), which power the document similarity. Still working on implementing the vectors again – they'll definitely be available in the final release.

Thanks again for your time, this was super valuable!

nikeqiang commented 7 years ago

@ines The displacy visualizer is working great!

in case not already updated, minor fix below to the sentence collection example in the docs, to account for the fact that render_ents(self, text, spans, title) takes char offsets (rather than token indexes):

original:

match_ents = [{'start': span.start-sent.start, 'end': span.end-sent.start,
                   'label': 'MATCH'}]

revision:

match_ents = [{'start': span.start_char-sent.start_char, 'end': span.end_char-sent.start_char,
                   'label': 'MATCH'}]
screen shot 2017-06-08 at 5 44 45 pm screen shot 2017-06-08 at 5 51 44 pm
testphys commented 7 years ago

It seems to me that the IS_OOV flag is always True so far.

alfonsomhc commented 7 years ago

@ines , thanks for the answer. The visualizer example (https://alpha.spacy.io/docs/usage/lightning-tour#displacy) works correctly if we replace displacy.serve with displacy.render. In addition, we have to pass the parameter jupyter=True. This is similar to what @nikeqiang showed in the previous message. I would suggest some improvements: a) If displacy.serve fails because it's running, the error could mention that displacy.serve doesn't work with Jupyter. b) When I read the documentation (https://alpha.spacy.io/docs/usage/visualizers#jupyter) it is not clear that one has to pass the parameter jupyter=True. There's nothing about this in the code, you need to spot it in the image...Therefore I would suggest a clarification in that documentation section.

ines commented 7 years ago

@testphys Ah, this is related to the vectors not being wired up yet. See my comment above. As soon as this is done, IS_OOV should work as expected.

@nikeqiang Well spotted – will fix that typo! Just out of curiosity, does displaCy also work fine for you in Jupyter if you don't set the Jupyter=True argument?

@alfonsomhc So without Jupyter=True, render doesn't work for you either? (i.e. it produces raw markup instead of rendered markup?) That's interesting. spaCy tries to detect Jupyter automatically using this logic: https://github.com/explosion/spaCy/issues/1058#issuecomment-301460880 There's always some error potential here, depending on people's environment and setup. This is why there's an additional jupyter argument to force Jupyter rendering.

But I agree, this should be more prominent in the docs. Maybe even as a little "infobox" at the top of the "Using displaCy in Jupyter" section. Additionally, we could also consider having displacy.serve just output the markup and not start the server if Jupyter is detected... I normally don't like these solutions, as it goes against the expected behaviour of a method (like, a method called serve should actually always serve). But in this case, it might actually make things less confusing...

alfonsomhc commented 7 years ago

@ines , that's correct: without Jupyter=True, render doesn't work (it returns raw markup). As I mentioned above, I'm running Jupyter 5.0.

egtann commented 7 years ago

Very excited to see a 15MB NN model in place of the big ones used previously! Thank you.

Mostly everything works great, however I'm seeing a surprising output with lemmas on 2.0. For example, given the sample sentence, "I need a hotel room" it outputs the following tokens:

Original, Lemma
I, I
need, ne
a hotel room, a

The lemma of "a hotel room" is "a". Is this intended? In 1.8 the lemma was a hotel room.

kootenpv commented 7 years ago

Evan, how are you merging those words? I thought a lemma is always only a single word...

On Mon, Jun 12, 2017, 17:38 Evan Tann notifications@github.com wrote:

Very excited to see a 15MB NN model in place of the big ones used previously! Thank you.

Mostly everything works great, however I'm seeing a surprising output with lemmas on 2.0. For example, given the sample sentence, "I need a hotel room" it outputs the following tokens:

Original, Lemma I, I need, ne a hotel room, a

The lemma of "a hotel room" is "a". Is this intended? In 1.8 the lemma was a hotel room.

β€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/issues/1105#issuecomment-307828119, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnncyQMoVN1J5whYJEuRMsltbS7TNF9ks5sDVuPgaJpZM4NwcOt .

egtann commented 7 years ago

Ah, the merge is very likely the issue (and probably on my end). Has the behavior changed for Span.merge in 2.0?

for np in doc.noun_chunks:
  np.merge(np.root.tag_, np.text, np.root.ent_type_)
for sent in doc.sents:
  for i, tkn in enumerate(sent):
    print(tkn.text)
    print(tkn.lemma_)
vishnune-kore commented 7 years ago

Hi @honnibal , Awesome work. I just want to know when the German model will be available, and expected final release week/day for v2.0

honnibal commented 7 years ago

@vishnunekkanti You could try training a German model yourself with the Universal Dependencies treebank -- the python -m spacy convert and python -m spacy train commands make this pretty easy.

I understand that a release date would be convenient for your planning. It'd be very convenient for our planning too -- so if you somehow find out, please be sure to tell us ;).

More seriously: we can decide what tasks we want to complete for the final release, but we can't decide (or accurately predict) how long those tasks will take. So if we set a fixed release date, the only way to meet it would be to cut tasks.

prashantgpt91 commented 7 years ago

@honnibal @ines @kootenpv @buhrmann I have freshly created a virtual env in python 2.7. followed the installation procedure above and this is the error I got. I am on OSX with no GPU support. How come Cuda error cameup? Or Is it something entirely different?

Spacy Env info

spacy-nightly==2.0.0a0 thinc==6.7.3

Info about spaCy

    Python version     2.7.10
    Platform           Darwin-15.6.0-x86_64-i386-64bit
    spaCy version      2.0.0a0
    Location           /Users/prashant/PycharmProjects/neralpha/neralpha/lib/python2.7/site-packages/spacy
    Models
Info about model en_core_web_sm

    lang               en
    pipeline           [u'tensorizer', u'tagger', u'parser', u'ner']
    name               core_web_sm
    license            CC BY-SA 3.0
    author             Explosion AI
    url                https://explosion.ai
    description        English multi-task CNN trained on OntoNotes 5. Assigns context-sensitive token vectors, POS tags, dependeny parse and named entities.
    source             /Users/prashant/PycharmProjects/neralpha/neralpha/lib/python2.7/site-packages/en_core_web_sm
    version            2.0.0a0
    spacy_version      >=2.0.0a0,<3.0.0
    email              contact@explosion.ai
    parent_package     spacy-nightly
Traceback (most recent call last):
  File "tryspacy.py", line 24, in <module>
    time_lemma()
  File "tryspacy.py", line 19, in time_lemma
    do_lemma(text)
  File "tryspacy.py", line 8, in do_lemma
    doc = nlp(text.decode('utf-8'))
  File "/Users/prashant/PycharmProjects/nlp_engine/ner/lib/python2.7/site-packages/spacy/language.py", line 248, in __call__
    doc = proc(doc)
  File "spacy/syntax/nn_parser.pyx", line 317, in spacy.syntax.nn_parser.Parser.__call__ (spacy/syntax/nn_parser.cpp:10513)
  File "spacy/syntax/nn_parser.pyx", line 362, in spacy.syntax.nn_parser.Parser.parse_batch (spacy/syntax/nn_parser.cpp:11361)
  File "/Users/prashant/PycharmProjects/nlp_engine/ner/lib/python2.7/site-packages/spacy/util.py", line 222, in get_cuda_stream
    return CudaStream() if CudaStream is not None else None
  File "/Users/prashant/PycharmProjects/nlp_engine/ner/lib/python2.7/site-packages/cupy/cuda/stream.py", line 110, in __init__
    self.ptr = runtime.streamCreate()
  File "cupy/cuda/runtime.pyx", line 295, in cupy.cuda.runtime.streamCreate (cupy/cuda/runtime.cpp:5507)
  File "cupy/cuda/runtime.pyx", line 298, in cupy.cuda.runtime.streamCreate (cupy/cuda/runtime.cpp:5455)
  File "cupy/cuda/runtime.pyx", line 130, in cupy.cuda.runtime.check_status (cupy/cuda/runtime.cpp:2241)
cupy.cuda.runtime.CUDARuntimeError: cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version
Exception AttributeError: "'Stream' object has no attribute 'ptr'" in <bound method Stream.__del__ of <cupy.cuda.stream.Stream object at 0x1127abbd0>> ignored
tpetmanson commented 7 years ago

Hello. I've been checking out the new spacy serialization and got it to the state, where it sometimes works, but not always. For example, here the first two tests work, but the third one fails:

import os
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab
import en_core_web_sm

base_dir = os.path.dirname(en_core_web_sm.__file__)
meta = en_core_web_sm.get_model_meta(base_dir)
data_dir = '%s_%s-%s' % (meta['lang'], meta['name'], meta['version'])
data_path = os.path.join(base_dir, data_dir, 'vocab')
vocab = Vocab()
vocab = vocab.from_disk(data_path)

nlp = spacy.load('en_core_web_sm')

def test(text):
    doc = nlp(text)
    doc2 = Doc(vocab)
    doc2.from_bytes(doc.to_bytes())
    print (doc2)
    print ([t.lemma_ for t in doc2])

test('walking dead are coming')
test('random 8asdf iuahsdfiuhaesf iuhasdfiu h4')
test('Another text. With two. Or more. Sentences. Timo Petmanson')

outputs:

walking dead are coming
['walk', 'dead', 'be', 'come']
random 8asdf iuahsdfiuhaesf iuhasdfiu h4
['random', '8asdf', 'iuahsdfiuhaesf', 'iuhasdfiu', 'h4']
Another text. With two. Or more. Sentences. Timo Petmanson

and then crashes with

    print ([t.lemma_ for t in doc2])
  File "spacy/tokens/token.pyx", line 608, in spacy.tokens.token.Token.lemma_.__get__ (spacy/tokens/token.cpp:12151)
  File "spacy/strings.pyx", line 122, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:2542)
KeyError: 2664285347204041025

Is there an easy way around this?

slavaGanzin commented 7 years ago

spacy-nightly is 26 days old. And master have some critical(for me :) ) bugfixes. May I ask for an update?

p.s. I compiled it from sources already. But it would be awesome if nightly builds would be more fresh anyway

mollerhoj commented 7 years ago

It would be really nice with an example of the new JSON format with values. (On this page: https://alpha.spacy.io/docs/api/annotation#json-input)

mraduldubey commented 7 years ago

Hello! I was wondering what happened to Vocab.load_vectors and Vocab.load_vectors_from_bin_loc methods? I could not seem to find them in spacy2.0 docs. They also had this bug, where loading custom trained word vectors from a file did not add words to vocabulary. I am trying to hook word vectors from a Word2Vec model trained over StackOverflow data dump into spaCy

rajeee commented 7 years ago

Trying to install on Windows (using anaconda prompt) throws the following error. Any pointers, please?

pip install spacy-nightly
... Creating library build\temp.win-amd64-3.6\Release\thinc/neural\gpu_ops.cp36-win_amd64.lib and object build\temp.win-amd64-3.6\Release\thinc/neural\gpu_ops.cp36-win_amd64.exp gpu_ops.obj : error LNK2001: unresolved external symbol "void cdecl gpu_max_pool(float ,int ,float const ,int const ,int,int,int)" (?gpu_max_pool@@YAXPEAMPEAHPEBMPEBHHHH@Z) gpu_ops.obj : error LNK2001: unresolved external symbol "void __cdecl gpu_backprop_max_pool(float ,float const ,int const ,int const ,int,int,int)" (?gpu_backprop_max_pool@@YAXPEAMPEBMPEBH2HHH@Z) gpu_ops.obj : error LNK2001: unresolved external symbol "void cdecl gpu_mean_pool(float ,float const ,int const ,int,int,int)" (?gpu_mean_pool@@YAXPEAMPEBMPEBHHHH@Z) gpu_ops.obj : error LNK2001: unresolved external symbol "void __cdecl gpu_backprop_mean_pool(float ,float const ,int const ,int,int,int)" (?gpu_backprop_mean_pool@@YAXPEAMPEBMPEBHHHH@Z) gpu_ops.obj : error LNK2001: unresolved external symbol "void cdecl gpu_hash_data(char ,char const ,unsigned int64,unsigned int64,unsigned int64,unsigned int)" (?gpu_hash_data@@YAXPEADPEBD_K22I@Z) build\lib.win-amd64-3.6\thinc\neural\gpu_ops.cp36-win_amd64.pyd : fatal error LNK1120: 5 unresolved externals error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe' failed with exit status 1120


Failed building wheel for thinc Running setup.py clean for thinc Failed to build spacy-nightly thinc Installing collected packages: thinc, spacy-nightly Found existing installation: thinc 6.5.2 Uninstalling thinc-6.5.2: Successfully uninstalled thinc-6.5.2 Running setup.py install for thinc ... error Complete output from command C:\Users\Rajendra\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\Rajendra\AppData\Local\Temp\pip-build-4yeqrukp\thinc\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\Rajendra\AppData\Local\Temp\pip-xh4_o8vg-record\install-record.txt --single-version-externally-managed --compile: Warning: The nvcc binary could not be located in your $PATH. For GPU capability, either add it to your path, or set $CUDA_HOME running install running build running build_py running build_ext building 'thinc.linalg' extension C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\Rajendra\Anaconda3\include -IC:\Users\Rajendra\AppData\Local\Temp\pip-build-4yeqrukp\thinc\include -IC:\Users\Rajendra\Anaconda3\include -IC:\Users\Rajendra\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tpthinc/linalg.cpp /Fobuild\temp.win-amd64-3.6\Release\thinc/linalg.obj gcc nvcc cl : Command line warning D9024 : unrecognized source file type 'gcc', object file assumed cl : Command line warning D9027 : source file 'gcc' ignored cl : Command line warning D9024 : unrecognized source file type 'nvcc', object file assumed cl : Command line warning D9027 : source file 'nvcc' ignored linalg.cpp c1xx: fatal error C1083: Cannot open source file: 'thinc/linalg.cpp': No such file or directory error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2

----------------------------------------

Rolling back uninstall of thinc Command "C:\Users\Rajendra\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\Rajendra\AppData\Local\Temp\pip-build-4yeqrukp\thinc\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\Rajendra\AppData\Local\Temp\pip-xh4_o8vg-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\Rajendra\AppData\Local\Temp\pip-build-4yeqrukp\thinc\

honnibal commented 7 years ago

@mraduldubey Word vectors are unfortunately the main "known broken" thing at the moment. Have a look at spacy/vectors.pyx . You might be able to fill it in?

mraduldubey commented 7 years ago

@honnibal I'll look into it.

karthikmurugadoss commented 7 years ago

Hi @honnibal, Amazing progress so far on v2! I'm looking forward to the release.

I'm unable to get the nlp.pipe method to run (as shown in code below):

import spacy
nlp = spacy.load('en_core_web_sm')  
# nlp = spacy.load('en')  # For spacy v1 --> Works perfectly
text_file = 'documents.txt'
with open(text_file, 'r') as f:
    texts = f.readlines()
for doc in nlp.pipe(texts, n_threads=16, batch_size=100):
    assert doc.is_parsed

This is the error I get on spacy v2:

/home/user/anaconda3/lib/python3.5/site-packages/spacy/_ml.py in forward(docs, drop)
    248         feats = []
    249         for doc in docs:
--> 250             feats.append(doc.to_array(cols))
    251         return feats, None
    252     model = layerize(forward)

AttributeError: 'str' object has no attribute 'to_array'

@alfonsomhc had highlighted the same error in a previous comment in this thread. Please let me know if this has been addressed or if there's something I'm missing.

honnibal commented 7 years ago

@karthikmurugadoss @alfonsomhc

This was a dumb problem -- I forgot to make the docs in the .pipe() command, so it's expecting Doc objects. The bug is fixed on the develop branch, and a new version should be pushed to nightly soon. In the meantime, try this:

docs = nlp.pipe((nlp.make_doc(text) for text in texts))
honnibal commented 7 years ago

@slavaGanzin Agreed --- will have an update out soon over the weekend. I'd hoped to circle back to this sooner, so thanks for your patience :)

vishnune-kore commented 7 years ago

Hi, @ines @honnibal "GPU utilization in nvidia settings showing 0 for ner training example" I am trying to run spacy v2 with gpu (geforce 840M ). gpu_ops are built and i executed the sample ner training code at https://alpha.spacy.io/docs/usage/training-ner#example , It runs perfectly but the GPU utilization is 0 i checked the cuda installation, path and if gpu ops were built or not. is it supposed to not use gpu? or could i be missing something?

honnibal commented 7 years ago

You'll need to pass device=0 to the nlp.begin_training method, to use GPU 0.

vishnune-kore commented 7 years ago

Thanks @honnibal , that did it.