amueller / word_cloud

A little word cloud generator in Python
https://amueller.github.io/word_cloud
MIT License
10.09k stars 2.31k forks source link

word cloud in foreign language. #367

Open ghost opened 6 years ago

ghost commented 6 years ago

Description

Trying to create word cloud in a foreign language

I have documented here

https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages

string1="""आजको छापा English Logo गृहपृष्ठ राजनीति समाज विचार किनमेल कला खेलकुद घुमफिर ब्लग साहित्यपाटी ग्लोबल फोटो ग्यालरी कस्तो छ प्रधानमन्त्रीको स्वास्थ्य? सरकारले सिण्डिकेट हटाएपछि देशैभरका टिकट काउन्टर बन्द उपेन्द्र यादवले फेरि दिए स\u200cंविधान नस्वीकारेको धम्की पौडेलले देउवालाई भने– प्रधानमन्त्री नभए पनि गणेशमानलाई जनताले पूज्छन्, तपाईंलाई कस्ले पुज्छ? ३३ किलो सुन गायब प्रकरण : यस्तो छ गोरे – प्रहरी ‘कनेक्सन’ चीनलाई उपहार दिने गैँडा फेला परेन काठमाडौंमा भारतका ३ पूर्वराजदूत गाउँ चम्किए, सदरमुकाम खस्किए यी हुन् मोबाइल नबोक्ने ‘ठूला मान्छे’ विगतको पोल खुल्ने डरले भगाइयो गोरेलाई गुराँस टिप्नेलाई ‘जंगलमै कारबाही’ नेपाल भ्रमणमा आफ्नै कार ल्याउँदैछन् मोदीले सिंहदरबारभित्र कोठा खोज्दै प्रधानमन्त्री कार्यालय डाक्टरले ‘भ्वाइस रेस्ट’ गर्न भनेका गच्छदार ३ घन्टा ५ मिनेट बोले, शुक्रबार थप १ घन्टा बोल्ने अभियुक्तसँग नाम थर मिल्दा निर्दोषलार्इ जेल सांसदहरूले व्यापार-व्यावसाय गर्न नपाउने सरकारको नीति तथा कार्यक्रम तयार, ८ प्रतिशतको आर्थिक वृद्धिको लक्ष्य स्वतन्त्र हुन सम्बन्धविच्छेद गर्ने क्रम बढ्यो मोदीको भ्रमण तालिका बनाउनै हम्मे बोली फेरिएन प्रधानमन्त्रीको: दुई बर्षपछि पनि उस्तै भाषण पञ्चायतदेखि नै\xa0सुन र शक्तिको सम्बन्ध! यी हुन् सुन तस्करीका ७ घुम्ती एमाले–माओवादीले १० हजार युवालाई मार्क्सवाद पढाउने साउदीमा नेपाली युवालाई मृत्युदण्डको फैसला सरकारी निकायले १३ अर्ब नतिर्दा गुठी थला ‘पूर्वी नेपाल भूकम्प उच्च जोखिममा’ सामुदायिक स्कुलमा पनि निजीजस्तै शुल्क सिंचाई विभागमा दिनहुँ चल्छ जुवातास गृहपृष्ठ ब्लग साहित्यपाटी पाठक विचार दसैं सामग्री छापाबाट फिड """

Steps/Code to Reproduce

wordcloud = WordCloud(max_font_size=300, background_color = 'white', relative_scaling=1, width=1500, height=1000, colormap='plasma').generate(string1)#generate_from_frequencies(linklist) plt.figure() plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show()

Produces the cloud as attached. https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages

amueller commented 6 years ago

See discussion in #315 and #238

ghost commented 6 years ago

Yes looks like there is no whole lot of support for Nepali language.

Using popular fonts like Preeti, Kantipur, etc, still generates squares or blocks. However I used Devnagari font available, it kinda generated some but cannot find a full solution to Nepali language. (see partial solution below)

https://stackoverflow.com/questions/50080183/word-cloud-or-visualization-in-foreign-languages/50081321#50081321

amueller commented 6 years ago

have you tried noto? And what's the problem you're currently facing?

kneupaneRecordedBooks commented 6 years ago

yes, it print the the letters but not printing the words correctly. Wordcloud does not make whole lot of sense.

for an instance, image

is not expected from this text

"""हिन्दू धर्मगुरु आचार्य श्रीनिवास पक्राउ परेका छन्। आफूले आफैंलाई गोली हान्न लगाएको अभियोगमा मोरङ प्रहरीको टोलीले श्रीनिवासलाई सोमबार काठमाडौंबाट पक्राउ गरेको हो।"""

amueller commented 6 years ago

If you can't give me more details on what the problem is, I won't be able to help you. So you're saying it's not splitting up the string into words correctly? You can provide your own regular expression. See #272 for a discussion on Thai. Also I highly recommend using Python3 for this. If you figure it out, I'd be happy to add more language specific documentation or examples

kneupaneRecordedBooks commented 6 years ago

Your answer to #272 seems me reasonable and infact it is not because of wordcloud. I know for fact. Its the font issue.

amueller commented 6 years ago

Well the current problem doesn't seem to be a font issue but a tokenization issue (I think) but I can't tell because I can't read the language.

kneupaneRecordedBooks commented 6 years ago

so being the first time trying to implement this in my own mother tongue, I totally feel embarrassed about this tokenization. What are the best resources to learn about tokenization for foreign language. I am sure you have been made aware of this fact.

amueller commented 6 years ago

I don't know. Maybe look at https://nlp.stanford.edu/IR-book/ Also: "foreign" is not really the right word to use here. That would imply as opposed to a native language. I think neither your nor my native language is English, so I wouldn't consider other languages foreign (for me English is a foreign language).

kneupaneRecordedBooks commented 6 years ago

yes probably you are right. I should have made it distinct to "Nepali". Yes, English is my second language. :-)

SilentFlame commented 6 years ago

@amueller Is this issue solved for devnagri script.? As by adding fonts too, I'm unable to get the display of words, still getting rectangular boxes.

amueller commented 6 years ago

@SilentFlame that probably means the font doesn't contain the symbols. Try rendering with PIL/pillow directly.

SilentFlame commented 6 years ago

@amueller thanks, I tried with the suggestion above and it works, just that I had to write my own regex for words.

Shorotshishir commented 5 years ago

Hello!, i am having trouble using bangla language text in word cloud. It is not tokenizing correctly. It is breaking the complex word and all the vowel signs. I tried CTLK to tokenize but still no luck.

amueller commented 5 years ago

@Shorotshishir you might need a custom regexp. Tokenization is unfortunately beyond the scope of this package. Let me know if you find a solution. spacy might help.

riyadhrazzaq commented 4 years ago

Hello!, i am having trouble using bangla language text in word cloud. It is not tokenizing correctly. It is breaking the complex word and all the vowel signs. I tried CTLK to tokenize but still no luck.

Use a custom regex and Bengali font. You can use something from Omicron Lab. Here's my code, but I am still facing incorrect glyph placement problem.

rgx = r"[\u0980-\u09FF]+"
wordcloud = WordCloud.(font_path=customFontPath',regexp=rgx).generate(text)
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
plt.show()
fuad021 commented 4 years ago

@riyadhrazzaq vai, did you solved the glyph placement problem?

riyadhrazzaq commented 4 years ago

@riyadhrazzaq vai, did you solved the glyph placement problem?

I did not.