Closed honnibal closed 5 years ago
@honnibal We can take inspiration from the repo here-https://github.com/anoopkunchukuttan/indic_nlp_library
Hi I have started working on a Indian language model for Marathi. Things I have:
Things required to implement a language in Spacy.
Things I have done below things so far.
I am working towards getting the remaining things.
I wanted to know a few things.
Other data that I have is for languages Hindi and Tamil.
@kaustubhn Can you please update with the progress? I am interested, especially for Marathi and Hindi but need some guidance to get started with it.
@SanketSKasar anyway we can communicate over email? would like to share the process and files. let me know how to get in touch.
@kaustubhn Interested in contributing in this effort. can we get it touch.
@SanketSKasar @gauravgr you can remove you comments (email) .
I've just added a simple Language
class for Hindi over on a feature branch of develop (v2.0.0): https://github.com/explosion/spaCy/tree/feature/hindi-tokenizer/spacy/lang/hi
It currently uses spaCy's basic tokenizer, adds stop words and a simple function setting a token's NORM
attribute to the word stem, if available (adapted from here / here).
Does this look reasonable? And does anyone have a few example sentences (text + expected tokens or text + expected stems) that we can try it on? I also hope that having the basics set up will make it easier for others to contribute more functionality in the future!
CC: @kaustubhn, @gauravgr, @SanketSKasar, @souravsingh
@ines I went through the code for Hindi Language tokenizer, it looks pretty reasonable, I would also like to point out that in Hindi the '|' sign is used as a end of the sentence sometimes. I have seen it used in many news websites so I think adding a tokenizer to consider '|' as a full-stop (english) would be great.
Also I had done some half finished work, here are the lemma rules you can add, these are derived from the 'Devanagari' Script these are applicable to two Indic languages [Hindi, Marathi] LEMMA_RULES = { "noun": [
["реж", "0"],
["рез", "1"],
["реи", "2"],
["рей", "3"],
["рек", "4"],
["рел", "5"],
["рем", "6"],
["рен", "7"],
["рео", "8"],
["реп", "9"],
],
"punct": [
["тАЬ", "\""],
["тАЭ", "\""],
["\u2018", "'"],
["\u2019", "'"]
]
}
@kaustubhn Thanks a lot, this is super helpful!
I've noticed the ред
character and added it to the base punctuation rules already (see e85e1d571b834d35922a816e1886cfc74cdf50d8) тАУ however, this is a different unicode character from |
, i.e. the actual pipe character. Do people use both, depending on keyboards etc? If so, we should probably add the pipe character to the punctuation rules as well, at least for Hindi.
The digit representations could also be useful to add a custom like_num
attribute (overwriting the getters for the lexical attributes is now really easy in v2.0). Maybe even the is_digit
attribute, or is that weird? Like, would you expect a token with the text "рей" to return True
for is_digit
? (Or should is_digit
behave like Python's isdigit()
? I guess this is more of a philosophical question.)
@ines I went through a subset of Hindi news papers to make sure the 'pipe' is not used, It indeed is not used I may have misread it. Some news papers have used full stop (.) as a stop sign. I think we can safely ignore the '|' (pipe) char and not add it to the punctuation rules.
Q) Would you expect a token with the text "рей" to return True for is_digit? I would say yes, I will expect to return True for is_digit. Since it is a digit equivalent to 3 in English. Not sure about the is_digit and isdigit() behaviour, I guess whatever feels intuitive should be the right approach.
Update, see PR in #1425 ЁЯОЙ
This pull request adds a basic Hindi
Language
class to support tokenization with spaCy. It also includes a getter for theNORM
attribute that adds the stem word if available (adapted from here). Since Hindi support has been requested a lot in the past, I hope this will make it easier for others to contribute and improve the language data. (I don't know Hindi, so I was only able to add the very basics тАУ feedback and contributions are appreciated!)
LIKE_NUM
and IS_DIGIT
getters and include the Hindi digit representations. They should probably also be included in the NORM
. One challenge here is to find the best and most performant way to check whether a string consists of Hindi digits.Happy to help here. Do most of the hindi support discussions happen on this thread?
@ajkl yes you can pitch in your suggestion and queries here, let's keep this thread active for discussions. Haven't come across any other issue which has Indic language related discussions. Also @ines would redirect any other related issues to this thread.
Hi guys, I'm new to spacy
and i was trying out the develop
branch for improving the Hindi language support .
I cloned it and then pip install -r requirements.txt
tells me everything works fine and is pre-installed.
But it throws this error ModuleNotFoundError: No module named 'spacy.symbols'
which doesn't make sense since the module is obviously right there.
Could you help me out ?
@abhi18av After installing the requirements, you also need to compile spaCy тАУ i.e. transform the Cython source into Python. Otherwise, spaCy won't be able to find the Cython modules. Just run the following from the spacy
directory:
python setup.py build_ext --inplace
export PYTHONPATH=`pwd`
The last command makes sure your PYTHONPATH
is set to the spaCy directory.
Thanks for helping out with Hindi btw!
@ines you're most welcome
Hi guys @kaustubhn @abhi18av and @ines I took up a project on NLP for indic languages. Is there a stop words list and punctuation list I can use. ? I understand that lemmatization takes more contributions. What are you guys working on, what is the checklist for completing this Hindi language support feature?
I found a source for high frequency wordlists and corpus for indic languages
https://ltrc.iiit.ac.in/showfile.php?filename=ltrc/internal/nlp/corpus/index.html
Indic languages datasets:
https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi/
Thank you so much Vineeth. In my experiments, I have created my own for Marathi. If you want, I can share with you.
@VineethKanaparthi Here's the current state of Hindi support in spaCy:
https://github.com/explosion/spaCy/tree/master/spacy/lang/hi
We already have a stop list and a stemmer (see the NORM
attribute in lex_attrs.py
).
@muleyprasad Im working on hindi, malayalam and tamil languages.
@ines thank you. what are the nextsteps for this feature?
Hey guys, any progress on this? Can someone please share their emails so we can maybe collaborate and complete this?
Hi @aashishg . Here's mine. vineeth0025@gmail.com.
@VineethKanaparthi Sent you a mail. Please check.
@aashishg Can you add me too, ankita.arora2609@gmail.com
Me too ;)
abhinav@fourtek.com
On Fri, Aug 10, 2018 at 2:07 PM ank-26 notifications@github.com wrote:
@aashishg https://github.com/aashishg Can you add me too, ankita.arora2609@gmail.com
тАФ You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/issues/641#issuecomment-412016335, or mute the thread https://github.com/notifications/unsubscribe-auth/AMNNXgHAv74SjnxU35SBkztfPpwbYcYMks5uPUZKgaJpZM4K3kM3 .
Is there a pretrained English to hindi translation model for spacy?
Can we translate english data in spaCy into other languages and use them?
@muleyprasad could you send me the list of stopwords and punctuations you have used for marathi? my email id is pawansasanka@gmail.com
@ines @honnibal could you add a basic Marathi tokenization as well? It's a language very close to hindi except for a few extra words and stem suffixes. The stemmer could be ported from here and here, the latter was adapted from the same paper you mentioned for the hindi support. The stem suffixes mentioned in the latter being, although not a complete list in tandem with the first this should cover a huge part :
suffixes = {
1: ["реЗ", "реВ", "реБ", "реА", "рд┐", "рд╛" , " реМ" , " реИ" , "рд╕" , "рд▓" , "рдд" , "рдо" , "рдЕ" , "рдд"],
2: ["рдиреЛ" , "рддреЛ" , "рдиреЗ" , "рдиреА" , "рд╣реА" , "рддреЗ" ,"рдпрд╛" , "рд▓рд╛" , "рдирд╛" , "рдКрдг" , "рд╢реЗ" , "рд╢реА" , "рдЪрд╛" , "рдЪреА" , "рдЪреЗ", "рдврд╛" , "рд░реБ" , "рдбреЗ" , "рддреА" , "рд╛рди" , " реАрдг" , "рдбрд╛" , "рдбреА" , "рдЧрд╛" , "рд▓рд╛" , "рд│рд╛" , "рдпрд╛" , "рд╡рд╛" , "рдпреЗ" , "рд╡реЗ" , "рддреА" ],
3: ["рд╢рдпрд╛" , "рд╣реВрди"],
4: [" реБрд░рдбрд╛"],
}
A list of basic stop words is available here, while the numbers are available here
Should i put in a PR?
Hello everyone, thanks for the great work done on this already. Is someone working on a POS tagger / NER for Hindi on spacy? Working on a project that requires both (if not both, at least a solid POS tagger). Would be great if it's tied right into spacy, given that the tokenizer is already in place. Let me know! Thanks.
Hello, can spacy support POS tagger ,NER,parser for hindi language? thanks
Merging this with the master thread in #3056!
Hi I have started working on a Indian language model for Marathi. Things I have:
- A corpus of news articles gathered from news sites size. 3 GB
- Basic understanding of Natural language processing and Machine Learning
Things required to implement a language in Spacy.
- Word Frequencies
- Brown Clusters
- Word Vectors
- Stop Words List
Things I have done below things so far.
- Brown Clusters
I am working towards getting the remaining things.
I wanted to know a few things.
- Amy going in the right direction as far as implementing a language in spacy?
- Is the data enough to implement a language model?
- How to reduce time to generate brown clusters ? i. I performed 500 clusters on 3 GB data it took around 50 Hours.
Other data that I have is for languages Hindi and Tamil.
Hi @kaustubhn , just saw your comment regarding a Marathi news corpus you have crawled. Is it publicly available somewhere. Ditto for Hindi?
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I think there are now quite a lot of Indian users of spaCy. Let's get started on the tokenizers :).
Hindi is whitespace-delimited, right? The docs for adding tokenizers can be found here: https://spacy.io/docs/usage/customizing-tokenizer