Open esthicodes opened 2 years ago
Google, Naver multiprocess image crawler (High Quality & Speed & Customizable)
Install Chrome
pip install -r requirements.txt
Write search keywords in keywords.txt
Run "main.py"
Files will be downloaded to 'download' directory.
usage:
python3 main.py [--skip true] [--threads 4] [--google true] [--naver true] [--full false] [--face false] [--no_gui auto] [--limit 0]
--skip true Skips keyword if downloaded directory already exists. This is needed when re-downloading.
--threads 4 Number of threads to download.
--google true Download from google.com (boolean)
--naver true Download from naver.com (boolean)
--full false Download full resolution image instead of thumbnails (slow)
--face false Face search mode
--no_gui auto No GUI mode. (headless mode) Acceleration for full_resolution mode, but unstable on thumbnail mode.
Default: "auto" - false if full=false, true if full=true
(can be used for docker linux system)
--limit 0 Maximum count of images to download per site. (0: infinite)
--proxy-list '' The comma separated proxy list like: "socks://127.0.0.1:1080,http://127.0.0.1:1081".
Every thread will randomly choose one from the list.
You can download full resolution image of JPG, GIF, PNG files by specifying --full true
Detects data imbalance based on number of files.
When crawling ends, the message show you what directory has under 50% of average files.
I recommend you to remove those directories and re-download.
sudo apt-get install xvfb <- This is virtual display
sudo apt-get install screen <- This will allow you to close SSH terminal while running.
screen -S s1
Xvfb :99 -ac & DISPLAY=:99 python3 main.py
You can make your own crawler by changing collect_links.py
As google site consistently changes, please make issues if it doesn't work.
Data Fields
Article Id – Article id unique given to the record
Article – Text of the header and article
Category – Category of the article (tech, business, sport, entertainment, politics)
Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data. The quality of the data should be checked before applying machine learning or data mining algorithms.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import re import nltk from nltk.corpus import stopwords nltk.download('stopwords') from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer nltk.download('wordnet') from nltk.tokenize import word_tokenize from nltk.tokenize import sent_tokenize nltk.download('punkt') from wordcloud import WordCloud from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report from sklearn.metrics import make_scorer, roc_curve, roc_auc_score from sklearn.metrics import precision_recall_fscore_support as score from sklearn.metrics.pairwise import cosine_similarity from sklearn.multiclass import OneVsRestClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC, LinearSVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
dataset.shape
`
dataset.info()
Columns of Dataset `
dataset['Category'].value_counts()
# Associate Category names with numerical index and save it in new column CategoryId target_category = dataset['Category'].unique() print(target_category)
` convert categories
dataset['CategoryId'] = dataset['Category'].factorize()[0] dataset.head()
`
Here you can show that news category’s name with respect to the following unique category ID.
# Create a new pandas dataframe "category", which only has unique Categories, also sorting this list in order of CategoryId values category = dataset[['Category', 'CategoryId']].drop_duplicates().sort_values('CategoryId') category
In data mining, Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task. It is not easy to look at a column of numbers or a whole spreadsheet and determine important characteristics of the data. It may be tedious, boring, and/or overwhelming to derive insights by looking at plain numbers. Exploratory data analysis techniques have been devised as an aid in this situation.
The below graph shows the news article count for category from our dataset.
dataset.groupby('Category').CategoryId.value_counts().plot(kind = "bar", color = ["pink", "orange", "red", "yellow", "blue"]) plt.xlabel("Category of data") plt.title("Visulaize numbers of Category of data") plt.show()
fig = plt.figure(figsize = (5,5)) colors = ["skyblue"] business = dataset[dataset['CategoryId'] == 0 ] tech = dataset[dataset['CategoryId'] == 1 ] politics = dataset[dataset['CategoryId'] == 2] sport = dataset[dataset['CategoryId'] == 3] entertainment = dataset[dataset['CategoryId'] == 4] count = [business['CategoryId'].count(), tech['CategoryId'].count(), politics['CategoryId'].count(), sport['CategoryId'].count(), entertainment['CategoryId'].count()] pie = plt.pie(count, labels = ['business', 'tech', 'politics', 'sport', 'entertainment'], autopct = "%1.1f%%", shadow = True, colors = colors, startangle = 45, explode = (0.05, 0.05, 0.05, 0.05,0.05))
Here we use the word cloud module to show the category-related words.
Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites.
`from wordcloud import WordCloud
stop = set(stopwords.words('english'))
business = dataset[dataset['CategoryId'] == 0]
business = business['Text']
tech = dataset[dataset['CategoryId'] == 1]
tech = tech['Text']
politics = dataset[dataset['CategoryId'] == 2]
politics = politics['Text']
sport = dataset[dataset['CategoryId'] == 3]
sport = sport['Text']
entertainment = dataset[dataset['CategoryId'] == 4]
entertainment = entertainment['Text']
def wordcloud_draw(dataset, color = 'white'):
words = ' '.join(dataset)
cleaned_word = ' '.join([word for word in words.split()
if (word != 'news' and word != 'text')])
wordcloud = WordCloud(stopwords = stop,
background_color = color,
width = 2500, height = 2500).generate(cleaned_word)
plt.figure(1, figsize = (10,7))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
print("business related words:")
wordcloud_draw(business, 'white')
print("tech related words:")
wordcloud_draw(tech, 'white')
print("politics related words:")
wordcloud_draw(politics, 'white')
print("sport related words:")
wordcloud_draw(sport, 'white')
print("entertainment related words:")
wordcloud_draw(entertainment, 'white')`
Show Text Column of Dataset Show Category Column of Dataset Remove All Tags Remove Special Characters Convert Everything in Lower Case Remove all Stopwords Lemmatizing the Words After Cleaning Text our Dataset Declared Dependent and Independent Value Create and Fit Bag of Words Model Train Test and Split the Dataset Create Empty List Create, Fit and Predict all ML Model Logistic Regression Multinomial Naive Bayes
Support Vector Machine
Decision Tree
KNN
Gaussian Naive Bayes
Best Model to Perform Accuracy Score
Fit & predict best ML Model
Predict News Article
Input: News Headline
Output: Classification of News Category zb: datasets for Text classification are used to categorize natural language texts according to content. For example, news articles by topic classification, or book reviews based on a positive or negative response classification. Most language detection, organizing customer feedback, and fraud detection are using TC.
Automation
with machine learningmodels.
Category classification, for news, is a multi-label text classification problem. The goal is to assign one or more categories to a news article. A standard technique in multi-label text classification is to use a set of binary classifiers.