aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.28k stars 337 forks source link

question about domains #247

Open dataf3l opened 2 years ago

dataf3l commented 2 years ago

Hi guys, I love this library.

I have a question: sometimes I get domain names as text input such as freizeit.com or toscanamare.com or someexample.com, notice that people don't nicely separate the text in the domain names like in "frei zeit" or "toscana mare", when I use a tokenizer, in order to detect the language of the domain, the tokenizer requires me to proivde a language, i.e. en.

is there a library that can, in a multi-language fashion split a word which contains more words into a. sub word by taking the best guess as to what the language is before splitting it, so that this library can do a good job at detecting the language from the text?

I googled "multi-language text split" but I'm not finding good results, I thought maybe you guys have worked on this issue before.

do you have hints for me?

Bachstelze commented 2 years ago

You could try the sentencepiece model from multilingual language processing pipelines. But they work on a subword level and you will have many possible combinations.