AutoViML / AutoViz

Automatically Visualize any dataset, any size with a single line of code. Created by Ram Seshadri. Collaborators Welcome. Permission Granted upon Request.
Apache License 2.0
1.7k stars 196 forks source link

is there any ways to generate wordcloud with other language? #111

Closed jjlee99 closed 2 months ago

jjlee99 commented 3 months ago

The problems

I applied the Titanic dataset to AutoBiz and confirmed that the WordCloud is working properly. The language I use in the dataset I'm trying to use is not English, but I think this will be a problem. Am I right? Or is there another problem? If you want, I'll show you some of the dataset

The errors


> > <html>
> > <body>
> > <!--StartFragment--><div class="lm-Widget lm-Panel jp-OutputArea-child" style="box-sizing: border-box; position: relative; overflow: hidden; display: flex; flex-direction: row; width: 976.766px; color: rgba(255, 255, 255, 0.87); font-family: system-ui, -apple-system, blinkmacsystemfont, &quot;Segoe UI&quot;, helvetica, arial, sans-serif, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Symbol&quot;; font-size: 13px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(17, 17, 17); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><div class="lm-Widget jp-RenderedText jp-mod-trusted jp-OutputArea-output" data-mime-type="application/vnd.jupyter.stdout" style="box-sizing: border-box; position: relative; overflow: auto; text-align: left; padding-left: 1ch; line-height: var(--jp-code-line-height); font-family: var(--jp-code-font-family); width: 912.766px; height: auto; user-select: text; flex-grow: 1; flex-shrink: 1;"><pre style="font-family: var(--jp-code-font-family); font-size: var(--jp-code-font-size); line-height: var(--jp-code-line-height); color: var(--jp-content-font-color1); border: none; margin: 0px; padding: 0px; overflow: auto; word-break: break-all; overflow-wrap: break-word; white-space: pre-wrap;">  Since nrows is smaller than dataset, loading random sample of 150000 rows into pandas...
> > Shape of your Data Set loaded: (150000, 6)
> > #######################################################################################
> > ######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
> > #######################################################################################
> > Classifying variables in data set...
> >   Printing up to 30 columns (max) in each category:
> >     Numeric Columns : []
> >     Integer-Categorical Columns: []
> >     String-Categorical Columns: ['industry_citygas_usage_mtrd_meter_yearmonth', 'business_status_business_sector_name']
> >     Factor-Categorical Columns: []
> >     String-Boolean Columns: []
> >     Numeric-Boolean Columns: []
> >     Discrete String Columns: ['business_status_build_address', 'business_status_business_category_name']
> >     NLP text Columns: ['industry_citygas_usage_mtrd_business_registration_no']
> >     Date Time Columns: []
> >     ID Columns: []
> >     Columns that will not be considered in modeling: []
> >     5 Predictors classified...
> >         No variables removed since no ID or low-information variables found in data set
> > Since Number of Rows in data 150000 exceeds maximum, randomly sampling 150000 rows for EDA...
> > 
> > ################ Regression problem #####################
> >    Columns to delete:
> > '   []'
> >    Boolean variables %s 
> > '   []'
> >    Categorical variables %s 
> > ("   ['industry_citygas_usage_mtrd_meter_yearmonth', "
> >  "'business_status_business_sector_name']")
> >    Continuous variables %s 
> > '   []'
> >    Discrete string variables %s 
> > ("   ['industry_citygas_usage_mtrd_business_registration_no', "
> >  "'business_status_build_address', 'business_status_business_category_name']")
> >    Date and time variables %s 
> > '   []'
> >    ID variables %s 
> > '   []'
> >    Target variable %s 
> > '   industry_citygas_usage_mtrd_usage_quantity'
> > To fix these data quality issues in the dataset, import FixDQ from autoviz...
> > There are 5287 duplicate rows in your dataset
> >     Alert: Dropping duplicate rows can sometimes cause your column data types to change to object!
> >     All variables classified into correct types.
> 
> 
Data Type Missing Values% Unique Values% Minimum Value Maximum Value DQ Issue
object 0.000000 0     Possible high cardinality column with 138 unique values: Use hash encoding or text embedding to reduce dimension.
object 0.000000 6     No issue
object 0.000000 0     No issue
object 0.000000 0     38 rare categories: Too many to list. Group them into a single category or drop the categories.
object 0.000000 0     Possible high cardinality column with 464 unique values: Use hash encoding or text embedding to reduce dimension.
int64 0.000000 15 0.000000 1942695.000000 Target column

No categorical or boolean vars in data set. Hence no pivot plots...
Could not draw catscatter plots... stat: path should be string, bytes, os.PathLike or integer, not NoneType
[nltk_data] Downloading collection 'popular'
[nltk_data]    |
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package omw to /home/jovyan/nltk_data...
[nltk_data]    |   Package omw is already up-to-date!
[nltk_data]    | Downloading package omw-1.4 to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package omw-1.4 is already up-to-date!
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet2021 to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package wordnet2021 is already up-to-date!
[nltk_data]    | Downloading package wordnet31 to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package wordnet31 is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package punkt to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package snowball_data is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    |
[nltk_data]  Done downloading collection popular
Could not draw wordcloud plot for industry_citygas_usage_mtrd_business_registration_no. We need at least 1 word to plot a word cloud, got 0.
Could not draw wordcloud plot for business_status_build_address. We need at least 1 word to plot a word cloud, got 0.
Could not draw wordcloud plot for business_status_business_category_name. We need at least 1 word to plot a word cloud, got 0.
All Plots are saved in ./result/dark/industry_citygas_usage_mtrd_usage_quantity

How to produce this error:

%matplotlib inline

# my saving routh
save_plot_dir = "./result/dark"

# activate Autoviz
dft = AV.AutoViz(
    filename="",
    sep=",",
    depVar="industry_citygas_usage_mtrd_usage_quantity",
    dfte=df3,
    header=0,
    verbose=2,
    lowess=True,
    chart_format="png",
    max_rows_analyzed=150000,
    max_cols_analyzed=30,
    save_plot_dir=save_plot_dir
)
AutoViML commented 3 months ago

Hi: thanks. I would like you to attach a zip file with a short version of the dataset so I can test it and tell you what the problem is.Also I need to know whether you are using the latest version of AutoViz. If not, tell me the version number.Auto Vimal On Wednesday, May 22, 2024 at 12:19:30 AM EDT, jjlee99 @.***> wrote:

The problems

I applied the Titanic dataset to AutoBiz and confirmed that the WordCloud is working properly. The language I use in the dataset I'm trying to use is not English, but I think this will be a problem. Am I right? Or is there another problem? If you want, I'll show you some of the dataset

The errors

|   | Data Type | Missing Values% | Unique Values% | Minimum Value | Maximum Value | DQ Issue | | object | 0.000000 | 0 |   |   | Possible high cardinality column with 138 unique values: Use hash encoding or text embedding to reduce dimension. | | | object | 0.000000 | 6 |   |   | No issue | | | object | 0.000000 | 0 |   |   | No issue | | | object | 0.000000 | 0 |   |   | 38 rare categories: Too many to list. Group them into a single category or drop the categories. | | | object | 0.000000 | 0 |   |   | Possible high cardinality column with 464 unique values: Use hash encoding or text embedding to reduce dimension. | | | int64 | 0.000000 | 15 | 0.000000 | 1942695.000000 | Target column | |

No categorical or boolean vars in data set. Hence no pivot plots... Could not draw catscatter plots... stat: path should be string, bytes, os.PathLike or integer, not NoneType [nltk_data] Downloading collection 'popular' [nltk_data] | [nltk_data] | Downloading package cmudict to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package cmudict is already up-to-date! [nltk_data] | Downloading package gazetteers to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package gazetteers is already up-to-date! [nltk_data] | Downloading package genesis to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package genesis is already up-to-date! [nltk_data] | Downloading package gutenberg to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package gutenberg is already up-to-date! [nltk_data] | Downloading package inaugural to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package inaugural is already up-to-date! [nltk_data] | Downloading package movie_reviews to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package movie_reviews is already up-to-date! [nltk_data] | Downloading package names to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package names is already up-to-date! [nltk_data] | Downloading package shakespeare to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package shakespeare is already up-to-date! [nltk_data] | Downloading package stopwords to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package stopwords is already up-to-date! [nltk_data] | Downloading package treebank to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package treebank is already up-to-date! [nltk_data] | Downloading package twitter_samples to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package twitter_samples is already up-to-date! [nltk_data] | Downloading package omw to /home/jovyan/nltk_data... [nltk_data] | Package omw is already up-to-date! [nltk_data] | Downloading package omw-1.4 to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package omw-1.4 is already up-to-date! [nltk_data] | Downloading package wordnet to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package wordnet is already up-to-date! [nltk_data] | Downloading package wordnet2021 to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package wordnet2021 is already up-to-date! [nltk_data] | Downloading package wordnet31 to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package wordnet31 is already up-to-date! [nltk_data] | Downloading package wordnet_ic to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package wordnet_ic is already up-to-date! [nltk_data] | Downloading package words to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package words is already up-to-date! [nltk_data] | Downloading package maxent_ne_chunker to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package maxent_ne_chunker is already up-to-date! [nltk_data] | Downloading package punkt to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package punkt is already up-to-date! [nltk_data] | Downloading package snowball_data to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package snowball_data is already up-to-date! [nltk_data] | Downloading package averaged_perceptron_tagger to [nltk_data] | /home/jovyan/nltk_data... [nltk_data] | Package averaged_perceptron_tagger is already up- [nltk_data] | to-date! [nltk_data] | [nltk_data] Done downloading collection popular Could not draw wordcloud plot for industry_citygas_usage_mtrd_business_registration_no. We need at least 1 word to plot a word cloud, got 0. Could not draw wordcloud plot for business_status_build_address. We need at least 1 word to plot a word cloud, got 0. Could not draw wordcloud plot for business_status_business_category_name. We need at least 1 word to plot a word cloud, got 0. All Plots are saved in ./result/dark/industry_citygas_usage_mtrd_usage_quantity How to produce this error:

%matplotlib inline

my saving routh

save_plot_dir = "./result/dark"

activate Autoviz

dft = AV.AutoViz( filename="", sep=",", depVar="industry_citygas_usage_mtrd_usage_quantity", dfte=df3, header=0, verbose=2, lowess=True, chart_format="png", max_rows_analyzed=150000, max_cols_analyzed=30, save_plot_dir=save_plot_dir )

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>