esthicodes / Awesome-Swiss-German

Multi-language Analyze text in 26 Cantonal Swiss German, Italian, German, Chinese (simplified), French, Italian. pply natural language understanding (NLU) to their applications with features including sentiment analysis, entity analysis, entity sentiment analysis, content classification, and syntax analysis.
MIT License
249 stars 26 forks source link

Topic Classification #48

Open esthicodes opened 2 years ago

esthicodes commented 2 years ago

Input: News Headline

Output: Classification of News Category zb: datasets for Text classification are used to categorize natural language texts according to content. For example, news articles by topic classification, or book reviews based on a positive or negative response classification. Most language detection, organizing customer feedback, and fraud detection are using TC.

Automation with machine learning models.

Category classification, for news, is a multi-label text classification problem. The goal is to assign one or more categories to a news article. A standard technique in multi-label text classification is to use a set of binary classifiers.

esthicodes commented 2 years ago

AutoCrawler

Google, Naver multiprocess image crawler (High Quality & Speed & Customizable)

How to use

  1. Install Chrome

  2. pip install -r requirements.txt

  3. Write search keywords in keywords.txt

  4. Run "main.py"

  5. Files will be downloaded to 'download' directory.

Arguments

usage:

python3 main.py [--skip true] [--threads 4] [--google true] [--naver true] [--full false] [--face false] [--no_gui auto] [--limit 0]
--skip true        Skips keyword if downloaded directory already exists. This is needed when re-downloading.
--threads 4        Number of threads to download.
--google true      Download from google.com (boolean)
--naver true       Download from naver.com (boolean)
--full false       Download full resolution image instead of thumbnails (slow)
--face false       Face search mode
--no_gui auto      No GUI mode. (headless mode) Acceleration for full_resolution mode, but unstable on thumbnail mode.
                   Default: "auto" - false if full=false, true if full=true
                   (can be used for docker linux system)

--limit 0          Maximum count of images to download per site. (0: infinite)
--proxy-list ''    The comma separated proxy list like: "socks://127.0.0.1:1080,http://127.0.0.1:1081".
                   Every thread will randomly choose one from the list.

Full Resolution Mode

You can download full resolution image of JPG, GIF, PNG files by specifying --full true

Data Imbalance Detection

Detects data imbalance based on number of files.

When crawling ends, the message show you what directory has under 50% of average files.

I recommend you to remove those directories and re-download.

Remote crawling through SSH on your server

sudo apt-get install xvfb <- This is virtual display
sudo apt-get install screen <- This will allow you to close SSH terminal while running.
screen -S s1
Xvfb :99 -ac & DISPLAY=:99 python3 main.py

Customize

You can make your own crawler by changing collect_links.py

Issues

As google site consistently changes, please make issues if it doesn't work.

esthicodes commented 2 years ago

Text Classification of News Articles

  1. [ ] Know about Data For the task of news classification with machine learning, I have collected a dataset from Kaggle, which contains news articles including their headlines and categories.

Data Fields

Article Id – Article id unique given to the record
Article – Text of the header and article
Category – Category of the article (tech, business, sport, entertainment, politics)

3. Data Cleaning and Data Preprocessing

Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data. The quality of the data should be checked before applying machine learning or data mining algorithms.

4. Import Libraries

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import re import nltk from nltk.corpus import stopwords nltk.download('stopwords') from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer nltk.download('wordnet') from nltk.tokenize import word_tokenize from nltk.tokenize import sent_tokenize nltk.download('punkt') from wordcloud import WordCloud from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report from sklearn.metrics import make_scorer, roc_curve, roc_auc_score from sklearn.metrics import precision_recall_fscore_support as score from sklearn.metrics.pairwise import cosine_similarity from sklearn.multiclass import OneVsRestClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC, LinearSVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB

5. Import Dataset

6. Shape of Dataset

dataset.shape

7. Check Information of Columns of Dataset

`

dataset.info()

Columns of Dataset `

7. Count Values of Categories

dataset['Category'].value_counts()

8. Convert Categories Name into Numerical Index

# Associate Category names with numerical index and save it in new column CategoryId target_category = dataset['Category'].unique() print(target_category)

Image

` convert categories

dataset['CategoryId'] = dataset['Category'].factorize()[0] dataset.head()

`

Image

9. Show Category’s Name w.r.t Category ID

Here you can show that news category’s name with respect to the following unique category ID.

# Create a new pandas dataframe "category", which only has unique Categories, also sorting this list in order of CategoryId values category = dataset[['Category', 'CategoryId']].drop_duplicates().sort_values('CategoryId') category

Image

Exploratory Data Analysis (EDA)

In data mining, Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task. It is not easy to look at a column of numbers or a whole spreadsheet and determine important characteristics of the data. It may be tedious, boring, and/or overwhelming to derive insights by looking at plain numbers. Exploratory data analysis techniques have been devised as an aid in this situation.

Visualizing Data

The below graph shows the news article count for category from our dataset.

dataset.groupby('Category').CategoryId.value_counts().plot(kind = "bar", color = ["pink", "orange", "red", "yellow", "blue"]) plt.xlabel("Category of data") plt.title("Visulaize numbers of Category of data") plt.show()

Image

fig = plt.figure(figsize = (5,5)) colors = ["skyblue"] business = dataset[dataset['CategoryId'] == 0 ] tech = dataset[dataset['CategoryId'] == 1 ] politics = dataset[dataset['CategoryId'] == 2] sport = dataset[dataset['CategoryId'] == 3] entertainment = dataset[dataset['CategoryId'] == 4] count = [business['CategoryId'].count(), tech['CategoryId'].count(), politics['CategoryId'].count(), sport['CategoryId'].count(), entertainment['CategoryId'].count()] pie = plt.pie(count, labels = ['business', 'tech', 'politics', 'sport', 'entertainment'], autopct = "%1.1f%%", shadow = True, colors = colors, startangle = 45, explode = (0.05, 0.05, 0.05, 0.05,0.05))

Image

10.. Visualizing Category Related Words

Here we use the word cloud module to show the category-related words.

Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites.

`from wordcloud import WordCloud

stop = set(stopwords.words('english'))

business = dataset[dataset['CategoryId'] == 0]

business = business['Text']

tech = dataset[dataset['CategoryId'] == 1]

tech = tech['Text']

politics = dataset[dataset['CategoryId'] == 2]

politics = politics['Text']

sport = dataset[dataset['CategoryId'] == 3]

sport = sport['Text']

entertainment = dataset[dataset['CategoryId'] == 4]

entertainment = entertainment['Text']

def wordcloud_draw(dataset, color = 'white'):

words = ' '.join(dataset)

cleaned_word = ' '.join([word for word in words.split()

if (word != 'news' and word != 'text')])

wordcloud = WordCloud(stopwords = stop,

background_color = color,

width = 2500, height = 2500).generate(cleaned_word)

plt.figure(1, figsize = (10,7))

plt.imshow(wordcloud)

plt.axis("off")

plt.show()

print("business related words:")

wordcloud_draw(business, 'white')

print("tech related words:")

wordcloud_draw(tech, 'white')

print("politics related words:")

wordcloud_draw(politics, 'white')

print("sport related words:")

wordcloud_draw(sport, 'white')

print("entertainment related words:")

wordcloud_draw(entertainment, 'white')`

Show Text Column of Dataset Show Category Column of Dataset Remove All Tags Remove Special Characters Convert Everything in Lower Case Remove all Stopwords Lemmatizing the Words After Cleaning Text our Dataset Declared Dependent and Independent Value Create and Fit Bag of Words Model Train Test and Split the Dataset Create Empty List Create, Fit and Predict all ML Model Logistic Regression Multinomial Naive Bayes

Support Vector Machine

Decision Tree

KNN

Gaussian Naive Bayes

Best Model to Perform Accuracy Score

Fit & predict best ML Model

Predict News Article