Closed ines closed 7 years ago
I honestly just wanted it to work :-)
I had spacy 1.5 installed on my other machine, I removed it and installed spacy-nightly v2.0.0a0. It works to import, but then tried to download with both:
python -m spacy download en
python -m spacy download en_core_web_sm
Both give:
Compatibility error
No compatible models found for v2.0.0a0 of spaCy.
It's most likely that I'm missing something?
EDIT: Indeed, on the release page it says: en_core_web_sm-2.0.0-alpha
. You also need to give the --direct
flag.
python -m spacy download en_core_web_sm-2.0.0-alpha --direct
Perhaps it is possible to temporarily update the docs for it?
Otherwise: it works!
Sorry about that! We originally decided against adding the alpha models to the compatibility table and shortcuts just yet to avoid confusion β but maybe it actually ended up causing more confusion. Just added the models and shortcuts, so in about 5 minutes (which is roughly how long it takes GitHub to clear its cache for raw files), the following commands should work as well:
python -m spacy download en
python -m spacy download xx
Another update: I tried parsing 16k headlines. I can parse all of them* and access some common attributes of each of them, including vectors :)
I did notice that on an empty string (1 of headlines*), it now throws an exception, this was not the case in v.18.2. Probably better to fix that :)
I wanted to do a benchmark against v.1.8.2, but the machines are not comparable :( It did feel a lot slower though...
Thanks!
Try the doc.similarity()
if you have a use-case for it? I'm not sure how well this works yet. It's using the tensors learned for the parser, NER and tagger (but no external data). It seems to have some interesting context sensitivity, and in theory it might give useful results --- but it hasn't been optimised for that. So, I'm curious to hear how it does.
Hi, very excited about the new, better, smaller and potentially faster(?) spaCy 2.0. I hope to give it a try in the next days. Just one question. According to the (new) docs the embeddings seem to work just as they did before, i.e. external word vectors and their averages for spans and docs. But you also mention the use of tensors for similarity calculations. Is it correct that the vectors are essentially the same, but are not used as such in the similarity calculations anymore? Or are they somehow combined with the internal tensor representations of the documents? In any case thanks for the great work, and I hope to be able to give some useful feedback soon about the Spanish model etc.
@buhrmann This is inherently a bit confusing, because there are two types of vector representations:
You can import word vectors, as before. The assumption is you'll want to leave these static, with perhaps a trainable projection layer to reduce dimension
The parser, NER, tagger etc learns a small embedding table and a depth-4 convolutional layer, to assign the document a tensor with a row for each token in context.
We're calling type 1 vector
, and type 2 tensor
. I've designed the neural network models to use a very small embedding table, shared between the parser, tagger and NER. I've also avoided using pre-trained vectors as features. I didn't want the models to depend on, say, the GloVe vectors, because I want to make sure you can load in any arbitrary word vectors without messing up the pipeline.
Thanks, that's clear now. I still had doubts about how the (type 1) vectors and (type 2) tensors are used in similarity calculations, since you mention above the tensors could have interesting properties in this context (something I'm keen to try). I've cleared this up looking at the code and it seems that the tensors for now are only used in similarity calculations when there are no word vectors available (which of course could easily be changed with user hooks).
Hi,
I wanted to do a quick benchmark between spaCy v1.8.2 and the v2.0.0. First of all, the memory usage is amazingly less in the new version! The old version's model took approximately 1GB memory, while the new one about 200MB.
However, I noticed that the latest release is using all 8 cores of my machine (100 % usage), but it is remarkably very, very slow!
I made two separate virtualenv to make sure the installation was clean. This is a small code I wrote to test it's speed -
import time
import spacy
nlp = spacy.load('en')
def do_lemma(text):
doc = nlp(text.decode('utf-8'))
lemma = []
for token in doc:
lemma.append(token.lemma_)
return ' '.join(lemma)
def time_lemma():
text = 'mangoes bought were nice this time' # just a stupid sentence
start = time.time()
for i in range(1000):
do_lemma(text)
end = time.time()
print end - start
time_lemma()
And for the latest release, same code only the model imports changed -
import time
import spacy
nlp = spacy.load('en_core_web_sm')
def do_lemma(text):
doc = nlp(text.decode('utf-8'))
lemma = []
for token in doc:
lemma.append(token.lemma_)
return ' '.join(lemma)
def time_lemma():
text = 'mangoes bought were nice this time' # just a stupid sentence
start = time.time()
for i in range(1000):
do_lemma(text)
end = time.time()
print end - start
time_lemma()
The first (v1.8.2) runs in 0.15 seconds
while the latest (v2.0.0) took 11.77 seconds
to run!
Is there something I'm doing wrong in the way I'm using the new model?
Hm! That's a lot worse than my tests, but in my tests I used the .pipe()
method, which lets the model minibatch. This helps to mask the Python overhead a bit. I still think the result you're seeing is much slower than I expect though.
A thought: Could you try setting export OPENBLAS_NUM_THREADS=1
and trying again? If your machine has lots of cores, it could be that the stupid thing tries to load up like 40 threads to do this tiny amount of work per document, and that kills the performance.
@honnibal Hi, setting export OPENBLAS_NUM_THREADS=1
surely helped! It avoided that 100% usage but it is still slower than the old guy. Now it takes about 4 seconds to run, way faster than before but still slow.
I just finished reading documentation for v2.0 and it's way better than for v1.*.
But this export OPENBLAS_NUM_THREADS=1
looks new for me. I thought blas used by numpy only to train vectors.
Could this be documented?
@slavaGanzin The neural network model makes lots of calls to numpy.tensordot
, which uses blas -- both for training and runtime. I'd like to have set this within the code --- even for my own usage I don't want to micromanage this stupid environment variable! The behaviour of "Spin up 40 threads to compute this tiny matrix multiplication" is one that nobody could want. So, we should figure out how to stop it from happening.
@eldor4do What happens if you use .pipe()
as well?
@honnibal I'll try with .pipe()
once, but in my actual use case I won't be able to use pipe()
, it would be more like repeated calls.
Hopefully the ability to hold more workers in memory compensates a bit.
Btw, the changes to the StringStore
are also very useful for multi-processing. The annotations from each worker are now easy to reconcile, because they're stored as hash IDs -- so the annotation encoding no longer depends on the worker's state.
@honnibal Yes, that is a plus. Also, I tested the OPENBLAS
value on 2 machines, on one it was able to reduce the threads, on the other, a 4 core machine, it failed to do so. Still all at 100% usage. Any idea what could be the problem?
@eldor4do That's annoying. I think it depends on what BLAS numpy is linked to. Is the second machine a mac? If so the relevant library will be Accelerate, not openblas. Maybe there's a numpy API for limiting the thread count?
Appreciate the feedback -- this is good alpha testing :)
@honnibal What are the expected differences for your test cases?
I get some errors when I run this example from the documentation (https://alpha.spacy.io/docs/usage/lightning-tour#examples-tokens-sentences):
doc = nlp(u"Peach emoji is where it has always been. Peach is the superior "
u"emoji. It's outranking eggplant π ")
assert doc[0].text == u'Peach'
assert doc[1].text == u'emoji'
assert doc[-1].text == u'π'
assert doc[17:19].text == u'outranking eggplant'
assert doc.noun_chunks[0].text == u'Peach emoji'
sentences = list(doc.sents)
assert len(sentences) == 3
assert sentences[0].text == u'Peach is the superior emoji.'
There are two problems
1) This expresion
doc.noun_chunks[0].text
has error
TypeError: 'generator' object is not subscriptable
2) This expresion
sentences[0].text
returns
'Peach emoji is where it has always been.'
and therefore the last assertion fails
I'm using python 3.6 (and spacy 2.0 alpha)
I have also a problem with the example: https://alpha.spacy.io/docs/usage/lightning-tour#examples-pos-tags
This statement fails:
assert [apple.pos_, apple.pos] == [u'PROPN', 17049293600679659579]
because
[apple.pos_, apple.pos]
returns
['PROPN', 95]
The rest of the assertions are fine.
I have also a problem with the example https://alpha.spacy.io/docs/usage/lightning-tour#displacy
The line
displacy.serve(doc_ent, style='ent')
gives the error:
OSError: [Errno 98] Address already in use
I'm running it from Jupyter. I have read the documentation (https://alpha.spacy.io/docs/usage/visualizers#jupyter) and I understand that Jupyter mode should be detected automatically. I tried setting
jupyter=True
but I got the same error.
If it helps, Im using Jupyter 5.0
Hi,
I don't know if you consider the following as a bug or not, but there is a difference between v2 and v1 on creating matcher:
With v1
import spacy
nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)
With v2 and the same code, I got this error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'spacy' has no attribute 'matcher'
Following new 101's doc, I changed my code to
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)
I think that it's a bug because I should access to Matcher just with spacy imported. What do you think about it ?
I have also a problem with the example https://alpha.spacy.io/docs/usage/lightning-tour#examples-word-vectors The first assert fails because it is false. The second assert line has an error:
File "<ipython-input-5-4d61871a144f>", line 9
assert apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector
^
SyntaxError: invalid syntax
In any case, if I run this:
apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector
I get
(False, False, False, False)
This makes me think that the model has no word vectors and therefore the similarities are wrong? I have installed the model that you describe in this page, that is with this command:
python -m spacy download en_core_web_sm-2.0.0-alpha --direct
In the documentation you say:
The default English model installs 300-dimensional vectors trained on the Common Crawl corpus. so then I assume I should have word vectors?
Another problem in https://alpha.spacy.io/docs/usage/lightning-tour#multi-threaded When I run the example I get this error:
...
/home/user/anaconda3/lib/python3.6/site-packages/spacy/_ml.py in forward(docs, drop)
248 feats = []
249 for doc in docs:
--> 250 feats.append(doc.to_array(cols))
251 return feats, None
252 model = layerize(forward)
AttributeError: 'str' object has no attribute 'to_array'
Another one, but in this case, this is a difference between the doc and the use (Of the blow, this make a difference between v1 and v2, but it may be a choice rather than a bug).
This code
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher = spacy.matcher.Matcher(nlp.vocab, pattern)
Give me following error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "spacy/matcher.pyx", line 184, in spacy.matcher.Matcher.__init__ (spacy/matcher.cpp:5534)
TypeError: __init__() takes exactly 1 positional argument (2 given)
In v1, and in the Doc too, it is specified that Matcher.__init__
accept two arguments. Vocab and/or patterns
def __init__(self, vocab, patterns={}):
"""
Create the Matcher.
Arguments:
vocab (Vocab):
The vocabulary object, which must be shared with the documents
the matcher will operate on.
patterns (dict): Patterns to add to the matcher.
Returns:
The newly constructed object.
"""
self._patterns = {}
self._entities = {}
self._acceptors = {}
self._callbacks = {}
self.vocab = vocab
self.mem = Pool()
for entity_key, (etype, attrs, specs) in sorted(patterns.items()):
self.add_entity(entity_key, attrs)
for spec in specs:
self.add_pattern(entity_key, spec, label=etype)
def __init__(self, vocab):
"""Create the Matcher.
vocab (Vocab): The vocabulary object, which must be shared with the
documents the matcher will operate on.
RETURNS (Matcher): The newly constructed object.
"""
self._patterns = {}
self._entities = {}
self._acceptors = {}
self._callbacks = {}
self.vocab = vocab
self.mem = Pool()
@alfonsomhc Thanks for the detailed analysis β and sorry, so many stupid typos! Just fixing the peach emoji example.
About displaCy: If you're using displaCy from within a notebook, you should call displacy.render()
β after all, you're already running a web server (the notebook server). I'm not actually sure what the expected behaviour of displacy.serve()
in Jupyter would be... your error message mostly looks like you're already running something else on displaCy's default port 5000
.
Either way, this should probably be more clear in the docs. And maybe displacy.serve
should at least print a warning in Jupyter mode that tells the user that starting the webserver is not actually necessary.
About the vectors: This is currently a bit messy, sorryΒ β the vectors that are supposed to be attached to the vocab aren't wired up correctly yet. Right now, there are only tensors (Doc.tensor
), which power the document similarity. Still working on implementing the vectors again β they'll definitely be available in the final release.
Thanks again for your time, this was super valuable!
@ines The displacy visualizer is working great!
in case not already updated, minor fix below to the sentence collection example in the docs, to account for the fact that render_ents(self, text, spans, title) takes char offsets (rather than token indexes):
original:
match_ents = [{'start': span.start-sent.start, 'end': span.end-sent.start,
'label': 'MATCH'}]
revision:
match_ents = [{'start': span.start_char-sent.start_char, 'end': span.end_char-sent.start_char,
'label': 'MATCH'}]
It seems to me that the IS_OOV
flag is always True
so far.
@ines , thanks for the answer. The visualizer example (https://alpha.spacy.io/docs/usage/lightning-tour#displacy) works correctly if we replace displacy.serve
with displacy.render
.
In addition, we have to pass the parameter jupyter=True
. This is similar to what @nikeqiang showed in the previous message.
I would suggest some improvements:
a) If displacy.serve
fails because it's running, the error could mention that displacy.serve
doesn't work with Jupyter.
b) When I read the documentation (https://alpha.spacy.io/docs/usage/visualizers#jupyter) it is not clear that one has to pass the parameter jupyter=True
. There's nothing about this in the code, you need to spot it in the image...Therefore I would suggest a clarification in that documentation section.
@testphys Ah, this is related to the vectors not being wired up yet. See my comment above. As soon as this is done, IS_OOV
should work as expected.
@nikeqiang Well spotted β will fix that typo! Just out of curiosity, does displaCy also work fine for you in Jupyter if you don't set the Jupyter=True
argument?
@alfonsomhc So without Jupyter=True
, render
doesn't work for you either? (i.e. it produces raw markup instead of rendered markup?) That's interesting. spaCy tries to detect Jupyter automatically using this logic: https://github.com/explosion/spaCy/issues/1058#issuecomment-301460880 There's always some error potential here, depending on people's environment and setup. This is why there's an additional jupyter
argument to force Jupyter rendering.
But I agree, this should be more prominent in the docs. Maybe even as a little "infobox" at the top of the "Using displaCy in Jupyter" section. Additionally, we could also consider having displacy.serve
just output the markup and not start the server if Jupyter is detected... I normally don't like these solutions, as it goes against the expected behaviour of a method (like, a method called serve
should actually always serve). But in this case, it might actually make things less confusing...
@ines , that's correct: without Jupyter=True, render doesn't work (it returns raw markup). As I mentioned above, I'm running Jupyter 5.0.
Very excited to see a 15MB NN model in place of the big ones used previously! Thank you.
Mostly everything works great, however I'm seeing a surprising output with lemmas on 2.0. For example, given the sample sentence, "I need a hotel room" it outputs the following tokens:
Original, Lemma
I, I
need, ne
a hotel room, a
The lemma of "a hotel room" is "a". Is this intended? In 1.8 the lemma was a hotel room
.
Evan, how are you merging those words? I thought a lemma is always only a single word...
On Mon, Jun 12, 2017, 17:38 Evan Tann notifications@github.com wrote:
Very excited to see a 15MB NN model in place of the big ones used previously! Thank you.
Mostly everything works great, however I'm seeing a surprising output with lemmas on 2.0. For example, given the sample sentence, "I need a hotel room" it outputs the following tokens:
Original, Lemma I, I need, ne a hotel room, a
The lemma of "a hotel room" is "a". Is this intended? In 1.8 the lemma was a hotel room.
β You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/issues/1105#issuecomment-307828119, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnncyQMoVN1J5whYJEuRMsltbS7TNF9ks5sDVuPgaJpZM4NwcOt .
Ah, the merge is very likely the issue (and probably on my end). Has the behavior changed for Span.merge in 2.0?
for np in doc.noun_chunks:
np.merge(np.root.tag_, np.text, np.root.ent_type_)
for sent in doc.sents:
for i, tkn in enumerate(sent):
print(tkn.text)
print(tkn.lemma_)
Hi @honnibal , Awesome work. I just want to know when the German model will be available, and expected final release week/day for v2.0
@vishnunekkanti You could try training a German model yourself with the Universal Dependencies treebank -- the python -m spacy convert
and python -m spacy train
commands make this pretty easy.
I understand that a release date would be convenient for your planning. It'd be very convenient for our planning too -- so if you somehow find out, please be sure to tell us ;).
More seriously: we can decide what tasks we want to complete for the final release, but we can't decide (or accurately predict) how long those tasks will take. So if we set a fixed release date, the only way to meet it would be to cut tasks.
@honnibal @ines @kootenpv @buhrmann
I have freshly created a virtual env in python 2.7
. followed the installation procedure above and this is the error I got. I am on OSX with no GPU support. How come Cuda error cameup? Or Is it something entirely different?
Spacy Env info
spacy-nightly==2.0.0a0
thinc==6.7.3
Info about spaCy
Python version 2.7.10
Platform Darwin-15.6.0-x86_64-i386-64bit
spaCy version 2.0.0a0
Location /Users/prashant/PycharmProjects/neralpha/neralpha/lib/python2.7/site-packages/spacy
Models
Info about model en_core_web_sm
lang en
pipeline [u'tensorizer', u'tagger', u'parser', u'ner']
name core_web_sm
license CC BY-SA 3.0
author Explosion AI
url https://explosion.ai
description English multi-task CNN trained on OntoNotes 5. Assigns context-sensitive token vectors, POS tags, dependeny parse and named entities.
source /Users/prashant/PycharmProjects/neralpha/neralpha/lib/python2.7/site-packages/en_core_web_sm
version 2.0.0a0
spacy_version >=2.0.0a0,<3.0.0
email contact@explosion.ai
parent_package spacy-nightly
Traceback (most recent call last):
File "tryspacy.py", line 24, in <module>
time_lemma()
File "tryspacy.py", line 19, in time_lemma
do_lemma(text)
File "tryspacy.py", line 8, in do_lemma
doc = nlp(text.decode('utf-8'))
File "/Users/prashant/PycharmProjects/nlp_engine/ner/lib/python2.7/site-packages/spacy/language.py", line 248, in __call__
doc = proc(doc)
File "spacy/syntax/nn_parser.pyx", line 317, in spacy.syntax.nn_parser.Parser.__call__ (spacy/syntax/nn_parser.cpp:10513)
File "spacy/syntax/nn_parser.pyx", line 362, in spacy.syntax.nn_parser.Parser.parse_batch (spacy/syntax/nn_parser.cpp:11361)
File "/Users/prashant/PycharmProjects/nlp_engine/ner/lib/python2.7/site-packages/spacy/util.py", line 222, in get_cuda_stream
return CudaStream() if CudaStream is not None else None
File "/Users/prashant/PycharmProjects/nlp_engine/ner/lib/python2.7/site-packages/cupy/cuda/stream.py", line 110, in __init__
self.ptr = runtime.streamCreate()
File "cupy/cuda/runtime.pyx", line 295, in cupy.cuda.runtime.streamCreate (cupy/cuda/runtime.cpp:5507)
File "cupy/cuda/runtime.pyx", line 298, in cupy.cuda.runtime.streamCreate (cupy/cuda/runtime.cpp:5455)
File "cupy/cuda/runtime.pyx", line 130, in cupy.cuda.runtime.check_status (cupy/cuda/runtime.cpp:2241)
cupy.cuda.runtime.CUDARuntimeError: cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version
Exception AttributeError: "'Stream' object has no attribute 'ptr'" in <bound method Stream.__del__ of <cupy.cuda.stream.Stream object at 0x1127abbd0>> ignored
Hello. I've been checking out the new spacy serialization and got it to the state, where it sometimes works, but not always. For example, here the first two tests work, but the third one fails:
import os
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab
import en_core_web_sm
base_dir = os.path.dirname(en_core_web_sm.__file__)
meta = en_core_web_sm.get_model_meta(base_dir)
data_dir = '%s_%s-%s' % (meta['lang'], meta['name'], meta['version'])
data_path = os.path.join(base_dir, data_dir, 'vocab')
vocab = Vocab()
vocab = vocab.from_disk(data_path)
nlp = spacy.load('en_core_web_sm')
def test(text):
doc = nlp(text)
doc2 = Doc(vocab)
doc2.from_bytes(doc.to_bytes())
print (doc2)
print ([t.lemma_ for t in doc2])
test('walking dead are coming')
test('random 8asdf iuahsdfiuhaesf iuhasdfiu h4')
test('Another text. With two. Or more. Sentences. Timo Petmanson')
outputs:
walking dead are coming
['walk', 'dead', 'be', 'come']
random 8asdf iuahsdfiuhaesf iuhasdfiu h4
['random', '8asdf', 'iuahsdfiuhaesf', 'iuhasdfiu', 'h4']
Another text. With two. Or more. Sentences. Timo Petmanson
and then crashes with
print ([t.lemma_ for t in doc2])
File "spacy/tokens/token.pyx", line 608, in spacy.tokens.token.Token.lemma_.__get__ (spacy/tokens/token.cpp:12151)
File "spacy/strings.pyx", line 122, in spacy.strings.StringStore.__getitem__ (spacy/strings.cpp:2542)
KeyError: 2664285347204041025
Is there an easy way around this?
spacy-nightly is 26 days old. And master have some critical(for me :) ) bugfixes. May I ask for an update?
p.s. I compiled it from sources already. But it would be awesome if nightly builds would be more fresh anyway
It would be really nice with an example of the new JSON format with values. (On this page: https://alpha.spacy.io/docs/api/annotation#json-input)
Hello! I was wondering what happened to Vocab.load_vectors
and Vocab.load_vectors_from_bin_loc
methods? I could not seem to find them in spacy2.0 docs. They also had this bug, where loading custom trained word vectors from a file did not add words to vocabulary. I am trying to hook word vectors from a Word2Vec model trained over StackOverflow data dump into spaCy
Trying to install on Windows (using anaconda prompt) throws the following error. Any pointers, please?
pip install spacy-nightly
... Creating library build\temp.win-amd64-3.6\Release\thinc/neural\gpu_ops.cp36-win_amd64.lib and object build\temp.win-amd64-3.6\Release\thinc/neural\gpu_ops.cp36-win_amd64.exp gpu_ops.obj : error LNK2001: unresolved external symbol "void cdecl gpu_max_pool(float ,int ,float const ,int const ,int,int,int)" (?gpu_max_pool@@YAXPEAMPEAHPEBMPEBHHHH@Z) gpu_ops.obj : error LNK2001: unresolved external symbol "void __cdecl gpu_backprop_max_pool(float ,float const ,int const ,int const ,int,int,int)" (?gpu_backprop_max_pool@@YAXPEAMPEBMPEBH2HHH@Z) gpu_ops.obj : error LNK2001: unresolved external symbol "void cdecl gpu_mean_pool(float ,float const ,int const ,int,int,int)" (?gpu_mean_pool@@YAXPEAMPEBMPEBHHHH@Z) gpu_ops.obj : error LNK2001: unresolved external symbol "void __cdecl gpu_backprop_mean_pool(float ,float const ,int const ,int,int,int)" (?gpu_backprop_mean_pool@@YAXPEAMPEBMPEBHHHH@Z) gpu_ops.obj : error LNK2001: unresolved external symbol "void cdecl gpu_hash_data(char ,char const ,unsigned int64,unsigned int64,unsigned int64,unsigned int)" (?gpu_hash_data@@YAXPEADPEBD_K22I@Z) build\lib.win-amd64-3.6\thinc\neural\gpu_ops.cp36-win_amd64.pyd : fatal error LNK1120: 5 unresolved externals error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe' failed with exit status 1120
Failed building wheel for thinc Running setup.py clean for thinc Failed to build spacy-nightly thinc Installing collected packages: thinc, spacy-nightly Found existing installation: thinc 6.5.2 Uninstalling thinc-6.5.2: Successfully uninstalled thinc-6.5.2 Running setup.py install for thinc ... error Complete output from command C:\Users\Rajendra\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\Rajendra\AppData\Local\Temp\pip-build-4yeqrukp\thinc\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\Rajendra\AppData\Local\Temp\pip-xh4_o8vg-record\install-record.txt --single-version-externally-managed --compile: Warning: The nvcc binary could not be located in your $PATH. For GPU capability, either add it to your path, or set $CUDA_HOME running install running build running build_py running build_ext building 'thinc.linalg' extension C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\Rajendra\Anaconda3\include -IC:\Users\Rajendra\AppData\Local\Temp\pip-build-4yeqrukp\thinc\include -IC:\Users\Rajendra\Anaconda3\include -IC:\Users\Rajendra\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tpthinc/linalg.cpp /Fobuild\temp.win-amd64-3.6\Release\thinc/linalg.obj gcc nvcc cl : Command line warning D9024 : unrecognized source file type 'gcc', object file assumed cl : Command line warning D9027 : source file 'gcc' ignored cl : Command line warning D9024 : unrecognized source file type 'nvcc', object file assumed cl : Command line warning D9027 : source file 'nvcc' ignored linalg.cpp c1xx: fatal error C1083: Cannot open source file: 'thinc/linalg.cpp': No such file or directory error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2
----------------------------------------
Rolling back uninstall of thinc Command "C:\Users\Rajendra\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\Rajendra\AppData\Local\Temp\pip-build-4yeqrukp\thinc\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\Rajendra\AppData\Local\Temp\pip-xh4_o8vg-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\Rajendra\AppData\Local\Temp\pip-build-4yeqrukp\thinc\
@mraduldubey Word vectors are unfortunately the main "known broken" thing at the moment. Have a look at spacy/vectors.pyx . You might be able to fill it in?
@honnibal I'll look into it.
Hi @honnibal, Amazing progress so far on v2! I'm looking forward to the release.
I'm unable to get the nlp.pipe method to run (as shown in code below):
import spacy
nlp = spacy.load('en_core_web_sm')
# nlp = spacy.load('en') # For spacy v1 --> Works perfectly
text_file = 'documents.txt'
with open(text_file, 'r') as f:
texts = f.readlines()
for doc in nlp.pipe(texts, n_threads=16, batch_size=100):
assert doc.is_parsed
This is the error I get on spacy v2:
/home/user/anaconda3/lib/python3.5/site-packages/spacy/_ml.py in forward(docs, drop)
248 feats = []
249 for doc in docs:
--> 250 feats.append(doc.to_array(cols))
251 return feats, None
252 model = layerize(forward)
AttributeError: 'str' object has no attribute 'to_array'
@alfonsomhc had highlighted the same error in a previous comment in this thread. Please let me know if this has been addressed or if there's something I'm missing.
@karthikmurugadoss @alfonsomhc
This was a dumb problem -- I forgot to make the docs in the .pipe()
command, so it's expecting Doc
objects. The bug is fixed on the develop branch, and a new version should be pushed to nightly soon. In the meantime, try this:
docs = nlp.pipe((nlp.make_doc(text) for text in texts))
@slavaGanzin Agreed --- will have an update out soon over the weekend. I'd hoped to circle back to this sooner, so thanks for your patience :)
Hi, @ines @honnibal "GPU utilization in nvidia settings showing 0 for ner training example" I am trying to run spacy v2 with gpu (geforce 840M ). gpu_ops are built and i executed the sample ner training code at https://alpha.spacy.io/docs/usage/training-ner#example , It runs perfectly but the GPU utilization is 0 i checked the cuda installation, path and if gpu ops were built or not. is it supposed to not use gpu? or could i be missing something?
You'll need to pass device=0
to the nlp.begin_training
method, to use GPU 0.
Thanks @honnibal , that did it.
We're very excited to finally publish the first alpha pre-release of spaCy v2.0. It's still an early release and (obviously) not intended for production use. You might come across a
NotImplementedError
β see the release notes for the implementation details that are still missing.This thread is intended for general discussion, feedback and all questions related to v2.0. If you come across more complex bugs, feel free to open a separate issue.
Quickstart & overview
The most important new features
Matcher
and language processing pipelines.Installation
spaCy v2.0.0-alpha is available on pip as
spacy-nightly
. If you want to test the new version, we recommend setting up a clean environment first. To install the new model, you'll have to download it with its full name, using the--direct
flag.Alpha models for German, French and Spanish are coming soon!
Now on to the fun part β stickers!
We just got our first delivery of spaCy stickers and want to to share them with you! There's only one small favour we'd like to ask. The part we're currently behind on are the tests β this includes our test suite as well as in-depth testing of the new features and usage examples. So here's the idea:
Submit a PR with your test to the
develop
branch β if the test covers a bug and currently fails, mark it with@pytest.mark.xfail
. For more info, see the test suite docs. Once your pull request is accepted, send us your address via email or private message on Gitter and we'll mail you stickers.If you can't find anything, don't have time or can't be bothered, that's fine too. Posting your feedback on spaCy v2.0 here counts as well. To be honest, we really just want to mail out stickers π