chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

Document how to make textacy.Doc(text) use default model if no language detected #233

Closed Motorrat closed 5 years ago

Motorrat commented 5 years ago

Textacy fails to detect language of some text and makes spacy throwing an exception for an absent model name. A fix seems not to be documented yet.

Expected Behavior

Tell the user about the possiblity to create a symlink for 'un' language in the documentation for example ln -s /home/user/venv/full/lib/python3.6/site-packages/de_core_news_sm /home/user/venv/full/lib/python3.6/site-packages/spacy/data/un or take a default_lang parameter in the textacy.Doc(text) signature - textacy.Doc(text, default_lang='de')

Current Behavior

Currently it fails with the follwing stack trace: doc = textacy.Doc(text) File "/home/user/venv/full/lib/python3.6/site-packages/textacy/doc.py", line 114, in __init__ self._init_from_text(content, metadata, lang) File "/home/user/venv/full/lib/python3.6/site-packages/textacy/doc.py", line 136, in _init_from_text spacy_lang = cache.load_spacy(langstr) File "/home/user/venv/full/lib/python3.6/site-packages/cachetools/__init__.py", line 46, in wrapper v = func(*args, **kwargs) File "/home/user/venv/full/lib/python3.6/site-packages/textacy/cache.py", line 99, in load_spacy return spacy.load(name, disable=disable) File "/home/user/venv/full/lib/python3.6/site-packages/spacy/__init__.py", line 21, in load return util.load_model(name, **overrides) File "/home/user/venv/full/lib/python3.6/site-packages/spacy/util.py", line 119, in load_model raise IOError(Errors.E050.format(name=name)) OSError: [E050] Can't find model 'un'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Context

I'm working with a collection of documents in various languages. The program stops due to an exception where I would rather proceed with falling back to the default language for that document.

Your Environment

textacy.utils.print_markdown(textacy.utils.get_config())

bdewilde commented 5 years ago

Hey @Motorrat , thanks for reporting this. I'm honestly not sure what most users would want in the case that automatic language detection fails (e.g. returns "un" for "unknown") and they don't know a priori what the language should be. It seems like

try:
    doc = textacy.Doc(text)
except IOError:
    # do something else

would be a safer bet than applying a default language model to whatever text fails language detection. This also works for languages that are correctly detected but don't have models available, right?

Would it be sufficient to mention this somewhere in the docs, and point folks to the external spacy model download/linking documentation?

bdewilde commented 5 years ago

Just added details to the docs! Hope that helps.