Open yotammanor opened 8 years ago
HIi @yotammanor,
Glad to hear that the Polyglot helps.
Polyglot relies on ICU for language detection, which is sometimes (e.g. for short text) not very accurate. In these cases (and in general any case where language is known), you can inform Text the language code using hint_language_code
parameter.
Example is:
>>> hebrew_text = u'זהו משפט בשפה העברית' #this is a sentence in Hebrew
>>> Text(hebrew_text, hint_language_code='he').language.code
'he'
Thanks for the usage tip!
Although, I will try to explain again what I meant by raising this issue - iw
is an old code for Hebrew, that was replaced with he
(or so I've read). The language detection actually detects correctly! It's just that the output is incorrect (and that causes other problems, etc.)
Reference: http://xml.coverpages.org/iso639a.html
HEBREW HE SEMITIC [*Changed 1989 from original ISO 639:1988, IW] ..
The identifier for Hebrew was changed from "iw" to "he".
This problem affects work greatly, since "he" language is significantly better supported in polyglot (and elsewhere) than "iw".
I am entirely not sure he
is really better or well supported in polyglot. It seems e.g. that a part of speech model is not available for he
nor for iw
.
from polyglot.text import Text
from polyglot.detect import Detector
# blob = """We will meet at eight o'clock on Thursday morning."""
blob = u"""אנחנו ניפגש בשמונה בבוקר ביום חמישי בבוקר"""
text = Text(blob, hint_language_code='he')
print(text.language.code) # okay
text.pos_tags
ValueError: Package 'pos2.he' not found in index
And this package can't be downloaded:
polyglot download pos2.he
[polyglot_data] Error loading pos2.he: Package 'pos2.he' not found in [polyglot_data] index Error installing package. Retry? [n/y/e]
Whereas other packages normally download, I am quite certain this is not just some connection or server availability problem...
I found this old issue, and yotammanor is correct. The code iw
is deprecated and will be taken out of use, and has been replaced with he
. All outputs should be switched from iw
to he
, but it should still accept both as input.
HIi @yotammanor,
Glad to hear that the Polyglot helps.
Polyglot relies on ICU for language detection, which is sometimes (e.g. for short text) not very accurate. In these cases (and in general any case where language is known), you can inform Text the language code using
hint_language_code
parameter.Example is:
>>> hebrew_text = u'זהו משפט בשפה העברית' #this is a sentence in Hebrew >>> Text(hebrew_text, hint_language_code='he').language.code 'he'
Hi, when using hint parameter for Hebrew as 'he', you'll get a different result and won't get 'iw'. (Check out cld2/internal/compact_lang_det_hint_code.cc)
Anyway, I have created a PR to fix it and update the language code. https://github.com/aboSamoor/pycld2/pull/47
Hi. I'm stll seeing this error, even witht the hint workaround. Is there a quick fix I can use im the mean time? 10x
from polyglot.downloader import downloader
downloader.supported_tasks(lang="he")
['sgns2',
'ner2',
'counts2',
'transliteration2',
'embeddings2',
'sentiment2',
'tsne2',
'morph2']
blob = """זו בדיקה של מחרוזת בעברית לזיהוי אוטומטי של יישויות."""
text = Text(blob, hint_language_code='he')
text.entities
for sent in text.sentences:
print(sent, "\n")
for entity in sent.entities:
print(entity.tag, entity)
[...]
File ~/code/polyglot-playground/.venv/lib/python3.10/site-packages/polyglot/downloader.py:933, in Downloader.info(self, id)
931 if id in self._packages: return self._packages[id]
932 if id in self._collections: return self._collections[id]
--> 933 raise ValueError('Package %r not found in index' % id)
ValueError: Package 'embeddings2.iw' not found in index
In [200]: polyglot.__version__ Out[200]: '16.07.04'
hebrew_text = u'זהו משפט בשפה העברית' #this is a sentence in Hebrew
Text(hebrew_text).language.code
Out[184]: 'iw'
This problem affects work greatly, since "he" language is significantly better supported in polyglot (and elsewhere) than "iw".
Thanks for all the hard work!