Hebrew Language is being detected with it's old code "iw" instead of "he"

aboSamoor / polyglot

Multilingual text (NLP) processing toolkit

http://polyglot-nlp.com

Other

2.31k stars 337 forks source link

Hebrew Language is being detected with it's old code "iw" instead of "he" #64

Open yotammanor opened 8 years ago

yotammanor commented 8 years ago

In [200]: polyglot.__version__ Out[200]: '16.07.04'

hebrew_text = u'זהו משפט בשפה העברית' #this is a sentence in Hebrew Text(hebrew_text).language.code Out[184]: 'iw'

This problem affects work greatly, since "he" language is significantly better supported in polyglot (and elsewhere) than "iw".

Thanks for all the hard work!

alantian commented 8 years ago

HIi @yotammanor,

Glad to hear that the Polyglot helps.

Polyglot relies on ICU for language detection, which is sometimes (e.g. for short text) not very accurate. In these cases (and in general any case where language is known), you can inform Text the language code using hint_language_code parameter.

Example is:

>>> hebrew_text = u'זהו משפט בשפה העברית' #this is a sentence in Hebrew
>>> Text(hebrew_text, hint_language_code='he').language.code
'he'

yotammanor commented 8 years ago

Thanks for the usage tip!

Although, I will try to explain again what I meant by raising this issue - iw is an old code for Hebrew, that was replaced with he (or so I've read). The language detection actually detects correctly! It's just that the output is incorrect (and that causes other problems, etc.)

Reference: http://xml.coverpages.org/iso639a.html

HEBREW HE SEMITIC [*Changed 1989 from original ISO 639:1988, IW] ..

The identifier for Hebrew was changed from "iw" to "he".

matanox commented 6 years ago

This problem affects work greatly, since "he" language is significantly better supported in polyglot (and elsewhere) than "iw".

I am entirely not sure he is really better or well supported in polyglot. It seems e.g. that a part of speech model is not available for he nor for iw.

from polyglot.text import Text
from polyglot.detect import Detector

# blob = """We will meet at eight o'clock on Thursday morning."""
blob = u"""אנחנו ניפגש בשמונה בבוקר ביום חמישי בבוקר"""

text = Text(blob, hint_language_code='he')

print(text.language.code) # okay

text.pos_tags

ValueError: Package 'pos2.he' not found in index

And this package can't be downloaded:

polyglot download pos2.he

[polyglot_data] Error loading pos2.he: Package 'pos2.he' not found in [polyglot_data] index Error installing package. Retry? [n/y/e]

Whereas other packages normally download, I am quite certain this is not just some connection or server availability problem...

Liggliluff commented 5 years ago

I found this old issue, and yotammanor is correct. The code iw is deprecated and will be taken out of use, and has been replaced with he. All outputs should be switched from iw to he, but it should still accept both as input.

dudio92 commented 2 years ago

HIi @yotammanor,

Glad to hear that the Polyglot helps.

Polyglot relies on ICU for language detection, which is sometimes (e.g. for short text) not very accurate. In these cases (and in general any case where language is known), you can inform Text the language code using hint_language_code parameter.

Example is:
>>> hebrew_text = u'זהו משפט בשפה העברית' #this is a sentence in Hebrew
>>> Text(hebrew_text, hint_language_code='he').language.code
'he'

Hi, when using hint parameter for Hebrew as 'he', you'll get a different result and won't get 'iw'. (Check out cld2/internal/compact_lang_det_hint_code.cc)

Anyway, I have created a PR to fix it and update the language code. https://github.com/aboSamoor/pycld2/pull/47

oatmealm commented 2 years ago

Hi. I'm stll seeing this error, even witht the hint workaround. Is there a quick fix I can use im the mean time? 10x

from polyglot.downloader import downloader
downloader.supported_tasks(lang="he")

['sgns2',
 'ner2',
 'counts2',
 'transliteration2',
 'embeddings2',
 'sentiment2',
 'tsne2',
 'morph2']

blob = """זו בדיקה של מחרוזת בעברית לזיהוי אוטומטי של יישויות."""
text = Text(blob, hint_language_code='he')

text.entities

for sent in text.sentences:
  print(sent, "\n")
  for entity in sent.entities:
    print(entity.tag, entity)

[...]

File ~/code/polyglot-playground/.venv/lib/python3.10/site-packages/polyglot/downloader.py:933, in Downloader.info(self, id)
    931 if id in self._packages: return self._packages[id]
    932 if id in self._collections: return self._collections[id]
--> 933 raise ValueError('Package %r not found in index' % id)

ValueError: Package 'embeddings2.iw' not found in index