Closed aginpatrick closed 8 years ago
Hi @aginpatrick , sorry about the late response, GitHub didn't send me an email for this issue. :/
The spacy_utils.preserve_case
function relies on POS information to make its decision, so yes, the document does need to be POS-tagged. That said, default usage will apply POS-tagging, i.e. doc = textacy.Doc(content, lang='en')
gives a document whose tokens have POS tags.
Could you show me how to reproduce this error? Also, what versions of textacy, spacy, and Python are you using?
Hi Burton, no problem for the late response! From your example section on github, I did:
In [1]: from textacy.keyterms import *
In [2]: import textacy
In [3]: cw = textacy.corpora.CapitolWords()
In [4]: docs = cw.records(speaker_name={'Hillary Clinton', 'Barack Obama'})
In [5]: content_stream, metadata_stream = textacy.fileio.split_record_fields(docs, 'text')
In [6]: corpus = textacy.Corpus('en', texts=content_stream, metadatas=metadata_stream)
In [7]: doc = corpus[-1]
The sgrank (or textrank) function raised the ValueError exception as doc is not POS-tagged. Thanks again, Patrick
Hi Patrick, which versions of textacy and spacy are you using? If you use spacy directly, do you see the same problem? For example, what does the following return?
import spacy
nlp = spacy.load('en')
doc = nlp('This is a sentence. Here is another one.')
doc.is_tagged
When I ran through the lines you posted above then called textacy.keyterms.textrank(doc)
, it correctly returned results.
textacy: 0.3.1 and spacy: 1.1.2 The result returned by the above commands is False (doc is not pos-tagged)
Okay, now we're getting somewhere: the problem isn't in textacy, it's in spacy. Those commands should return a POS-tagged document.
Have you downloaded spacy's nlp models via $ python -m spacy.en.download
? What do the following lines produce:
print([tok.orth_ for tok in doc])
print([tok.tag_ for tok in doc])
In [26]: print([tok.orth_ for tok in doc])
['This', 'is', 'a', 'sentence', '.', 'Here', 'is', 'another', 'one', '.']
In [27]: print([tok.tag_ for tok in doc])
['', '', '', '', '', '', '', '', '', '']
Okay, just confirming. Have you downloaded spacy's nlp models?
I don't remember, I probably simply pip installed spacy.
No data dir in my /usr/local/lib/python3.4/dist-packages/spacy/en directory. So no, I didn't download the nlp models. Sorry about that.
No problem! Glad we figured it out. This is a known issue, actually: https://github.com/explosion/spaCy/issues/578
When I read the post, I understand that since version 1.0 of spacy, language data is already packaged into code, so that I can get basic usage without the data download, isn't it?
AFAIK only the tokenizer is included within the code itself, while the POS-tagger, parser, entity recognizer, etc. must be downloaded separately. That's why you got tokens via [tok.orth_ for tok in doc]
but they did not have POS tags as in [tok.tag_ for tok in doc]
.
Hi, I followed all the steps in the example section but the textacy.keyterms.textrank(doc, n_keyterms=10) function returns the following error:
ValueError: token is not POS-tagged
The error is raised in the spacy_utils.preserve_case function. I supoose that you have to run another function first to pos-tag the document, am I right?