aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.3k stars 337 forks source link

issue while running NER for Hindi language #172

Open kusumlata123 opened 5 years ago

kusumlata123 commented 5 years ago

I ran following program : import os from polyglot.text import Text blob=""" बृहस्पतिवार को पणज़ी में शुरू हुए ३६वें अंतर्राष्ट्री फिल्म महोत्सव के रंग में भंग उस समय पड़ा जब वहां पर तैनात सुरक्षाकर्मि ने बॉलीवुड की अभिनेत्री बिपाशा बसु के साथ दुव्यर्वहार किया । """ text=Text(blob,hint_language_code='hi') print(text.entities)

i got following output [I-PER(['सुरक्षाकर्मि']), I-PER(['बिपाशा', 'बसु'])]

according to me , LOC =पणज़ी ,Organistaion = बॉलीवुड , these also comes under NER

sajalcody commented 5 years ago

There can be multiple transliterations for a Hindi word. Panaji can be transliterated as पणजी, पनजी or पणज़ी. Polyglot is trained for only one transcription. If you replace पणज़ी with पणजी, polyglot will identify it as a location.

Bollywood is not the name of a company or organization, that is why maybe polyglot has not identified it as an ORG entity.

kusumlata123 commented 3 years ago

polyglot download embeddings2.hi ner2.hi download LANG:hi [polyglot_data] Error loading embeddings2.hi: <urlopen error [Errno [polyglot_data]     110] Connection timed out> Error installing package. Retry? [n/y/e]