Open ghost opened 6 years ago
See discussion in #315 and #238
Yes looks like there is no whole lot of support for Nepali language.
Using popular fonts like Preeti, Kantipur, etc, still generates squares or blocks. However I used Devnagari font available, it kinda generated some but cannot find a full solution to Nepali language. (see partial solution below)
have you tried noto? And what's the problem you're currently facing?
yes, it print the the letters but not printing the words correctly. Wordcloud does not make whole lot of sense.
for an instance,
is not expected from this text
"""हिन्दू धर्मगुरु आचार्य श्रीनिवास पक्राउ परेका छन्। आफूले आफैंलाई गोली हान्न लगाएको अभियोगमा मोरङ प्रहरीको टोलीले श्रीनिवासलाई सोमबार काठमाडौंबाट पक्राउ गरेको हो।"""
If you can't give me more details on what the problem is, I won't be able to help you. So you're saying it's not splitting up the string into words correctly? You can provide your own regular expression. See #272 for a discussion on Thai. Also I highly recommend using Python3 for this. If you figure it out, I'd be happy to add more language specific documentation or examples
Your answer to #272 seems me reasonable and infact it is not because of wordcloud. I know for fact. Its the font issue.
Well the current problem doesn't seem to be a font issue but a tokenization issue (I think) but I can't tell because I can't read the language.
so being the first time trying to implement this in my own mother tongue, I totally feel embarrassed about this tokenization. What are the best resources to learn about tokenization for foreign language. I am sure you have been made aware of this fact.
I don't know. Maybe look at https://nlp.stanford.edu/IR-book/ Also: "foreign" is not really the right word to use here. That would imply as opposed to a native language. I think neither your nor my native language is English, so I wouldn't consider other languages foreign (for me English is a foreign language).
yes probably you are right. I should have made it distinct to "Nepali". Yes, English is my second language. :-)
@amueller Is this issue solved for devnagri script.? As by adding fonts too, I'm unable to get the display of words, still getting rectangular boxes.
@SilentFlame that probably means the font doesn't contain the symbols. Try rendering with PIL/pillow directly.
@amueller thanks, I tried with the suggestion above and it works, just that I had to write my own regex for words.
Hello!, i am having trouble using bangla language text in word cloud. It is not tokenizing correctly. It is breaking the complex word and all the vowel signs. I tried CTLK to tokenize but still no luck.
@Shorotshishir you might need a custom regexp. Tokenization is unfortunately beyond the scope of this package. Let me know if you find a solution. spacy might help.
Hello!, i am having trouble using bangla language text in word cloud. It is not tokenizing correctly. It is breaking the complex word and all the vowel signs. I tried CTLK to tokenize but still no luck.
Use a custom regex and Bengali font. You can use something from Omicron Lab. Here's my code, but I am still facing incorrect glyph placement problem.
rgx = r"[\u0980-\u09FF]+"
wordcloud = WordCloud.(font_path=customFontPath',regexp=rgx).generate(text)
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
plt.show()
@riyadhrazzaq vai, did you solved the glyph placement problem?
@riyadhrazzaq vai, did you solved the glyph placement problem?
I did not.
Description
Trying to create word cloud in a foreign language
I have documented here
https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages
string1="""आजको छापा English Logo गृहपृष्ठ राजनीति समाज विचार किनमेल कला खेलकुद घुमफिर ब्लग साहित्यपाटी ग्लोबल फोटो ग्यालरी कस्तो छ प्रधानमन्त्रीको स्वास्थ्य? सरकारले सिण्डिकेट हटाएपछि देशैभरका टिकट काउन्टर बन्द उपेन्द्र यादवले फेरि दिए स\u200cंविधान नस्वीकारेको धम्की पौडेलले देउवालाई भने– प्रधानमन्त्री नभए पनि गणेशमानलाई जनताले पूज्छन्, तपाईंलाई कस्ले पुज्छ? ३३ किलो सुन गायब प्रकरण : यस्तो छ गोरे – प्रहरी ‘कनेक्सन’ चीनलाई उपहार दिने गैँडा फेला परेन काठमाडौंमा भारतका ३ पूर्वराजदूत गाउँ चम्किए, सदरमुकाम खस्किए यी हुन् मोबाइल नबोक्ने ‘ठूला मान्छे’ विगतको पोल खुल्ने डरले भगाइयो गोरेलाई गुराँस टिप्नेलाई ‘जंगलमै कारबाही’ नेपाल भ्रमणमा आफ्नै कार ल्याउँदैछन् मोदीले सिंहदरबारभित्र कोठा खोज्दै प्रधानमन्त्री कार्यालय डाक्टरले ‘भ्वाइस रेस्ट’ गर्न भनेका गच्छदार ३ घन्टा ५ मिनेट बोले, शुक्रबार थप १ घन्टा बोल्ने अभियुक्तसँग नाम थर मिल्दा निर्दोषलार्इ जेल सांसदहरूले व्यापार-व्यावसाय गर्न नपाउने सरकारको नीति तथा कार्यक्रम तयार, ८ प्रतिशतको आर्थिक वृद्धिको लक्ष्य स्वतन्त्र हुन सम्बन्धविच्छेद गर्ने क्रम बढ्यो मोदीको भ्रमण तालिका बनाउनै हम्मे बोली फेरिएन प्रधानमन्त्रीको: दुई बर्षपछि पनि उस्तै भाषण पञ्चायतदेखि नै\xa0सुन र शक्तिको सम्बन्ध! यी हुन् सुन तस्करीका ७ घुम्ती एमाले–माओवादीले १० हजार युवालाई मार्क्सवाद पढाउने साउदीमा नेपाली युवालाई मृत्युदण्डको फैसला सरकारी निकायले १३ अर्ब नतिर्दा गुठी थला ‘पूर्वी नेपाल भूकम्प उच्च जोखिममा’ सामुदायिक स्कुलमा पनि निजीजस्तै शुल्क सिंचाई विभागमा दिनहुँ चल्छ जुवातास गृहपृष्ठ ब्लग साहित्यपाटी पाठक विचार दसैं सामग्री छापाबाट फिड """
Steps/Code to Reproduce
wordcloud = WordCloud(max_font_size=300, background_color = 'white', relative_scaling=1, width=1500, height=1000, colormap='plasma').generate(string1)#generate_from_frequencies(linklist) plt.figure() plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show()
Produces the cloud as attached. https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages