AlexGonRo / Instance-Selection-Algorithms-Spark

GNU General Public License v3.0
1 stars 1 forks source link
data-mining instance-selection mapreduce parallelization scala spark

Parallel instance selection algorithms using Spark framework

Description

This repository offers a variety of parallel data mining techniques using the Apache Spark™ and its RDD structures.

This library offers the possibility to easily setup a data mining pipeline and run it in a parallel enviroment. We recommend the use of the console user interface when launching a task. However, in an effort to make the use of our work easier, we also offer a basic GUI that offers the basic functionalities.

full_interface.png

Although at the time of its creation the mayor concern of the library was the implementation of instance selection algorithms, the structure of the library still allows for the implementation and use of any other data mining tasks.

Rigth now, the most remarkable content this repository gives access to are the instance selection algorithms Locality Sensitive Hashing Instance Selection (LSHIS) and Democratic Instance Selection (DemoIS) and the classifier knn.

The first version of this work was presented in February 2016 as a Computer Science bachelor's thesis at University of Burgos. The old repository can still be found here: https://bitbucket.org/agr00095/tfg-alg.-seleccion-instancias-spark

Before running

Please, check that your system fulfill all the following requirements. We do not guarantee that this library will work under a different configuation.

Releases

You can find the newest version of the library (already built) in our Releases section.

Two different files are provided:

Build it yourself

If you need an older version of this program or want to build it using a different configuration, we provide the possibility to do so.

Open a command window and locate the root directory. Modify the POM XML file if needed and execute the following line:

$ mvn clean package

Execution

Right now, the program only allows for the execution of the following types of pipelines:

You can have a look to the squeleton of the execution command:

$SPARK_HOME/bin/spark-submit --master "URL" ["OTHER_SPARK_ARGS"] \
--class "launcher.ExperimentLauncher" "PATH_JAR" ISClassExec \
-r "PATH_DATASET" ["OTHER_LOADING_ARGS"] -f "PATH_INSTANCE_SELECTION_ALG"\
 "ALGORITHM_ARGS" -c "PATH_CLASSIFIER" \
"CLASSIFIER_ARGS" [-cv "CV_ARGS"]

In the command above:

Example

The following code shows an example command:

$SPARK_HOME/bin/spark-submit --master spark://alejandro:7077
--class "launcher.ExperimentLauncher" "./ISAlgorithms.jar" ISClassExec \
-r ./Data/dataset1 \
-f instanceSelection.demoIS.DemoIS \
-c classification.seq.knn.KNN\
-cv 10

This command will run in master node spark://alejandro:7077. The dataset is called dataset1 and will be our input to a Democratic Instance Selection algorithm and a classifier knn with the default parameters. A 10-fold cross-validation will be performed.

Additional notes

Cite

When citing this implementation, please refer to:

Articles