Coursal / Text-Sentiment-Analysis-In-Hadoop-And-Spark

The source code developed and used for the purposes of my thesis with the same title under the guidance of my supervisor professor Vasilis Mamalis for the Department of Informatics and Computer Engineering of the University of West Attica.
1 stars 1 forks source link
apache-hadoop apache-spark hadoop hadoop-mapreduce mapreduce naive-bayes naive-bayes-classification naive-bayes-classifier opinion-mining sentiment-analysis sentiment-classification spark spark-mllib support-vector-machine-svm support-vector-machines text-classification text-mining

Text Sentiment Analysis In Hadoop & Spark

The source code developed and used for the purposes of my thesis with the same title under the guidance of my supervisor professor Vasilis Mamalis for the Department of Informatics and Computer Engineering of the University of West Attica.

You can read the text and presentation of the thesis (both in greek) and cite it with the URIs below from the university's institutional repository):

BY-NC-SAAttribution-NonCommercial-ShareAlike

Main Objectives

Developed Applications

All the classifications models of the applications below use 75% of the input for training and the rest 25% for testing purposes.

Using the Hadoop framework

For the Hadoop-based applications, 10 datasets that differ in length of tweets (100k up to 1m tweets) have been created and provided here under the /input directory named train# and test# (# being a number 1-10).

For the Spark-based applications, 10 datasets that differ in length of tweets (100k up to 1m tweets) have been created and provided here under the /input directory appropriately named spark_input_# (# being a number 1-10).

Execution Guide

Up to the development of the source code and compiling of the thesis, the most current stable releases were:

Hadoop-based Application Execution Guide

Simple Version of Naive Bayes
javac -classpath "$(yarn classpath)" -d NB_classes NB.java
jar -cvf NB.jar -C NB_classes/ .
hadoop jar NB.jar NB train# test# training_split testing_split
Modified Version of Naive Bayes
javac -classpath "$(yarn classpath)" -d Modified_NB_classes Modified_NB.java
jar -cvf Modified_NB.jar -C Modified_NB_classes/ .
hadoop jar Modified_NB.jar Modified_NB train# test# training_split testing_split

Where train#, test# are the desires datasets to be used from /input and training_split, testing_split are the amounts of chunks to be split in a number of mappers (must be defined as bytes).


Spark-based Application Execution Guide

Simple Version of Naive Bayes
sbt package
spark-submit --master yarn --deploy-mode client ./target/scala-2.12/nb_2.12-0.1.jar #
Modified Version of Naive Bayes
sbt package
spark-submit --master yarn --deploy-mode client ./target/scala-2.12/modified_nb_2.12-0.1.jar #
Simple Version of SVM
sbt package
spark-submit --master yarn --deploy-mode client ./target/scala-2.12/svm_2.12-0.1.jar #
Modified Version of SVM
sbt package
spark-submit --master yarn --deploy-mode client ./target/scala-2.12/modified_svm_2.12-0.1.jar #

Where # is the number indicating the desired dataset to be used from /input.