explosion / spaCy

ЁЯТл Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.64k stars 4.36k forks source link

ЁЯТл Indic language tokenizers (Hindi, etc) #641

Closed honnibal closed 5 years ago

honnibal commented 7 years ago

I think there are now quite a lot of Indian users of spaCy. Let's get started on the tokenizers :).

Hindi is whitespace-delimited, right? The docs for adding tokenizers can be found here: https://spacy.io/docs/usage/customizing-tokenizer

souravsingh commented 7 years ago

@honnibal We can take inspiration from the repo here-https://github.com/anoopkunchukuttan/indic_nlp_library

kaustubhn commented 7 years ago

Hi I have started working on a Indian language model for Marathi. Things I have:

  1. A corpus of news articles gathered from news sites size. 3 GB
  2. Basic understanding of Natural language processing and Machine Learning

Things required to implement a language in Spacy.

  1. Word Frequencies
  2. Brown Clusters
  3. Word Vectors
  4. Stop Words List

Things I have done below things so far.

  1. Brown Clusters

I am working towards getting the remaining things.

I wanted to know a few things.

  1. Amy going in the right direction as far as implementing a language in spacy?
  2. Is the data enough to implement a language model?
  3. How to reduce time to generate brown clusters ? i. I performed 500 clusters on 3 GB data it took around 50 Hours.

Other data that I have is for languages Hindi and Tamil.

SanketSKasar commented 7 years ago

@kaustubhn Can you please update with the progress? I am interested, especially for Marathi and Hindi but need some guidance to get started with it.

kaustubhn commented 7 years ago

@SanketSKasar anyway we can communicate over email? would like to share the process and files. let me know how to get in touch.

gauravgr commented 7 years ago

@kaustubhn Interested in contributing in this effort. can we get it touch.

kaustubhn commented 7 years ago

@SanketSKasar @gauravgr you can remove you comments (email) .

ines commented 6 years ago

I've just added a simple Language class for Hindi over on a feature branch of develop (v2.0.0): https://github.com/explosion/spaCy/tree/feature/hindi-tokenizer/spacy/lang/hi

It currently uses spaCy's basic tokenizer, adds stop words and a simple function setting a token's NORM attribute to the word stem, if available (adapted from here / here).

Does this look reasonable? And does anyone have a few example sentences (text + expected tokens or text + expected stems) that we can try it on? I also hope that having the basics set up will make it easier for others to contribute more functionality in the future!

CC: @kaustubhn, @gauravgr, @SanketSKasar, @souravsingh

kaustubhn commented 6 years ago

@ines I went through the code for Hindi Language tokenizer, it looks pretty reasonable, I would also like to point out that in Hindi the '|' sign is used as a end of the sentence sometimes. I have seen it used in many news websites so I think adding a tokenizer to consider '|' as a full-stop (english) would be great.

Also I had done some half finished work, here are the lemma rules you can add, these are derived from the 'Devanagari' Script these are applicable to two Indic languages [Hindi, Marathi] LEMMA_RULES = { "noun": [

Hindi digit representations

    ["реж", "0"],
    ["рез", "1"],
    ["реи", "2"],
    ["рей", "3"],
    ["рек", "4"],
    ["рел", "5"],
    ["рем", "6"],
    ["рен", "7"],
    ["рео", "8"],
    ["реп", "9"],
],
"punct": [
    ["тАЬ", "\""],
    ["тАЭ", "\""],
    ["\u2018", "'"],
    ["\u2019", "'"]
]

}

ines commented 6 years ago

@kaustubhn Thanks a lot, this is super helpful!

I've noticed the ред character and added it to the base punctuation rules already (see e85e1d571b834d35922a816e1886cfc74cdf50d8) тАУ however, this is a different unicode character from |, i.e. the actual pipe character. Do people use both, depending on keyboards etc? If so, we should probably add the pipe character to the punctuation rules as well, at least for Hindi.

The digit representations could also be useful to add a custom like_num attribute (overwriting the getters for the lexical attributes is now really easy in v2.0). Maybe even the is_digit attribute, or is that weird? Like, would you expect a token with the text "рей" to return True for is_digit? (Or should is_digit behave like Python's isdigit()? I guess this is more of a philosophical question.)

kaustubhn commented 6 years ago

@ines I went through a subset of Hindi news papers to make sure the 'pipe' is not used, It indeed is not used I may have misread it. Some news papers have used full stop (.) as a stop sign. I think we can safely ignore the '|' (pipe) char and not add it to the punctuation rules.

Q) Would you expect a token with the text "рей" to return True for is_digit? I would say yes, I will expect to return True for is_digit. Since it is a digit equivalent to 3 in English. Not sure about the is_digit and isdigit() behaviour, I guess whatever feels intuitive should be the right approach.

ines commented 6 years ago

Update, see PR in #1425 ЁЯОЙ

This pull request adds a basic Hindi Language class to support tokenization with spaCy. It also includes a getter for the NORM attribute that adds the stem word if available (adapted from here). Since Hindi support has been requested a lot in the past, I hope this will make it easier for others to contribute and improve the language data. (I don't know Hindi, so I was only able to add the very basics тАУ feedback and contributions are appreciated!)

Todos and further ideas

ajkl commented 6 years ago

Happy to help here. Do most of the hindi support discussions happen on this thread?

kaustubhn commented 6 years ago

@ajkl yes you can pitch in your suggestion and queries here, let's keep this thread active for discussions. Haven't come across any other issue which has Indic language related discussions. Also @ines would redirect any other related issues to this thread.

abhi18av commented 6 years ago

Hi guys, I'm new to spacy and i was trying out the develop branch for improving the Hindi language support .

I cloned it and then pip install -r requirements.txt tells me everything works fine and is pre-installed.

But it throws this error ModuleNotFoundError: No module named 'spacy.symbols' which doesn't make sense since the module is obviously right there.

Could you help me out ?

ines commented 6 years ago

@abhi18av After installing the requirements, you also need to compile spaCy тАУ i.e. transform the Cython source into Python. Otherwise, spaCy won't be able to find the Cython modules. Just run the following from the spacy directory:

python setup.py build_ext --inplace
export PYTHONPATH=`pwd`

The last command makes sure your PYTHONPATH is set to the spaCy directory.

Thanks for helping out with Hindi btw!

abhi18av commented 6 years ago

@ines you're most welcome

VineethKanaparthi commented 6 years ago

Hi guys @kaustubhn @abhi18av and @ines I took up a project on NLP for indic languages. Is there a stop words list and punctuation list I can use. ? I understand that lemmatization takes more contributions. What are you guys working on, what is the checklist for completing this Hindi language support feature?

VineethKanaparthi commented 6 years ago

I found a source for high frequency wordlists and corpus for indic languages

https://ltrc.iiit.ac.in/showfile.php?filename=ltrc/internal/nlp/corpus/index.html

Indic languages datasets:

https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi/

muleyprasad commented 6 years ago

Thank you so much Vineeth. In my experiments, I have created my own for Marathi. If you want, I can share with you.

ines commented 6 years ago

@VineethKanaparthi Here's the current state of Hindi support in spaCy:

https://github.com/explosion/spaCy/tree/master/spacy/lang/hi

We already have a stop list and a stemmer (see the NORM attribute in lex_attrs.py).

VineethKanaparthi commented 6 years ago

@muleyprasad Im working on hindi, malayalam and tamil languages.

@ines thank you. what are the nextsteps for this feature?

aashishg commented 6 years ago

Hey guys, any progress on this? Can someone please share their emails so we can maybe collaborate and complete this?

VineethKanaparthi commented 6 years ago

Hi @aashishg . Here's mine. vineeth0025@gmail.com.

aashishg commented 6 years ago

@VineethKanaparthi Sent you a mail. Please check.

ank-26 commented 6 years ago

@aashishg Can you add me too, ankita.arora2609@gmail.com

abhi18av commented 6 years ago

Me too ;)

abhinav@fourtek.com

On Fri, Aug 10, 2018 at 2:07 PM ank-26 notifications@github.com wrote:

@aashishg https://github.com/aashishg Can you add me too, ankita.arora2609@gmail.com

тАФ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/issues/641#issuecomment-412016335, or mute the thread https://github.com/notifications/unsubscribe-auth/AMNNXgHAv74SjnxU35SBkztfPpwbYcYMks5uPUZKgaJpZM4K3kM3 .

romass12 commented 6 years ago

Is there a pretrained English to hindi translation model for spacy?

talusannni commented 5 years ago

Can we translate english data in spaCy into other languages and use them?

Shashi456 commented 5 years ago

@muleyprasad could you send me the list of stopwords and punctuations you have used for marathi? my email id is pawansasanka@gmail.com

Shashi456 commented 5 years ago

@ines @honnibal could you add a basic Marathi tokenization as well? It's a language very close to hindi except for a few extra words and stem suffixes. The stemmer could be ported from here and here, the latter was adapted from the same paper you mentioned for the hindi support. The stem suffixes mentioned in the latter being, although not a complete list in tandem with the first this should cover a huge part :

suffixes = {
    1: ["реЗ", "реВ", "реБ", "реА", "рд┐", "рд╛" , " реМ"  , " реИ" ,  "рд╕" , "рд▓" , "рдд" , "рдо" , "рдЕ" ,  "рдд"],
    2: ["рдиреЛ" , "рддреЛ" , "рдиреЗ" , "рдиреА" , "рд╣реА" , "рддреЗ" ,"рдпрд╛" , "рд▓рд╛" , "рдирд╛" , "рдКрдг" , "рд╢реЗ" , "рд╢реА" , "рдЪрд╛" , "рдЪреА" , "рдЪреЗ", "рдврд╛" , "рд░реБ" , "рдбреЗ" ,  "рддреА" , "рд╛рди" , " реАрдг" , "рдбрд╛" , "рдбреА" , "рдЧрд╛" , "рд▓рд╛" , "рд│рд╛" , "рдпрд╛" , "рд╡рд╛" , "рдпреЗ" , "рд╡реЗ" , "рддреА" ],
    3: ["рд╢рдпрд╛" , "рд╣реВрди"],
    4: [" реБрд░рдбрд╛"],
}

A list of basic stop words is available here, while the numbers are available here

Should i put in a PR?

gokulnathsridhar commented 5 years ago

Hello everyone, thanks for the great work done on this already. Is someone working on a POS tagger / NER for Hindi on spacy? Working on a project that requires both (if not both, at least a solid POS tagger). Would be great if it's tied right into spacy, given that the tokenizer is already in place. Let me know! Thanks.

kusumlata123 commented 5 years ago

Hello, can spacy support POS tagger ,NER,parser for hindi language? thanks

ines commented 5 years ago

Merging this with the master thread in #3056!

anoopkunchukuttan commented 5 years ago

Hi I have started working on a Indian language model for Marathi. Things I have:

  1. A corpus of news articles gathered from news sites size. 3 GB
  2. Basic understanding of Natural language processing and Machine Learning

Things required to implement a language in Spacy.

  1. Word Frequencies
  2. Brown Clusters
  3. Word Vectors
  4. Stop Words List

Things I have done below things so far.

  1. Brown Clusters

I am working towards getting the remaining things.

I wanted to know a few things.

  1. Amy going in the right direction as far as implementing a language in spacy?
  2. Is the data enough to implement a language model?
  3. How to reduce time to generate brown clusters ? i. I performed 500 clusters on 3 GB data it took around 50 Hours.

Other data that I have is for languages Hindi and Tamil.

Hi @kaustubhn , just saw your comment regarding a Marathi news corpus you have crawled. Is it publicly available somewhere. Ditto for Hindi?

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.