AutoViML / AutoViz

Automatically Visualize any dataset, any size with a single line of code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.
Apache License 2.0
1.71k stars 197 forks source link

Suggesting Updated for Wordcloud #57

Closed chekoduadarsh closed 2 years ago

chekoduadarsh commented 2 years ago

1. Updating Stopwords List

Currently, I can see that Stopwords are defined as a list and I can see that it is missing a few stop words like "themselves".


def return_stop_words():
    STOP_WORDS = ['it', "this", "that", "to", 'its', 'am', 'is', 'are', 'was', 'were', 'a',
                'an', 'the', 'and', 'or', 'of', 'at', 'by', 'for', 'with', 'about', 'between',
                 'into','above', 'below', 'from', 'up', 'down', 'in', 'out', 'on', 'over',
                  'under', 'again', 'further', 'then', 'once', 'all', 'any', 'both', 'each',
                   'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so',
                    'than', 'too', 'very', 's', 't', 'can', 'just', 'd', 'll', 'm', 'o', 're',
                    've', 'y', 'ain', 'ma']
    add_words = ["s", "m",'you', 'not',  'get', 'no', 'via', 'one', 'still', 'us', 'u','hey','hi','oh','jeez',
                'the', 'a', 'in', 'to', 'of', 'i', 'and', 'is', 'for', 'on', 'it', 'got','aww','awww',
                'not', 'my', 'that', 'by', 'with', 'are', 'at', 'this', 'from', 'be', 'have', 'was',
                '', ' ', 'say', 's', 'u', 'ap', 'afp', '...', 'n', '\\']
    stop_words = list(set(STOP_WORDS+add_words))
    return sorted(stop_words)

Isn't it better to use NLTK stop words list??

from nltk.corpus import stopwords

for lang in langs:
  stopwords.words(lang)

Copied from: https://gist.github.com/sebleier/554280

2. Lemmatization before plotting

I think it is better if we lemmatize the data before we plot then words like "reads", "reading" will count as the same, which will give us a better word cloud.

AutoViML commented 2 years ago

Hi @chekoduadarsh 👍 Thanks for your comments and inputs. They are very interesting.

Isn't it better to use NLTK stop words list?? Sorry no. I found that it is mostly useless. That's why I created my own list by taking some from it and adding others from my own experience. I will add a couple of words you suggested however.

I think it is better if we lemmatize the data Yes. Let me work on it. Look for it in the next version. Just do: pip install autoviz --upgrade in the next day or so. Thanks again, AutoViz

chekoduadarsh commented 2 years ago

@AutoViML, Thank you,

let me know if u need support from me for the second point. I will be happy to do a PR.

AutoViML commented 2 years ago

@chekoduadarsh 👍 Thank you for your offer. May be next time, you can make a PR. I already committed the change. Can you please test it? Just upgrade... Thanks AutoViML

chekoduadarsh commented 2 years ago

Ok Thank you I will upgrade autoviz