ayrna / orca

Ordinal Regression and Classification Algorithms
http://www.uco.es/grupos/ayrna/orreview
GNU General Public License v3.0
114 stars 35 forks source link
machine-learning matlab octave ordinal-classification ordinal-regression support-vector-machine

Build Status LICENSE

ORCA logo

ORCA

ORCA (Ordinal Regression and Classification Algorithms) is a MATLAB framework that implements and integrates a wide range of ordinal regression methods and performance metrics from the paper "Ordinal regression methods: survey and experimental study" published in IEEE Transactions on Knowledge and Data Engineering. ORCA also helps to accelerate classifier experimental comparison with automatic fold execution, experiment paralellisation and performance reports. A basic definition of ordinal regression can be found at Wikipedia.

As a generic experimental framework, its two main objectives are:

  1. To run experiments easily to facilitate the comparison between algorithms and datasets.
  2. To provide an easy way of including new algorithms into the framework by simply defining the training and test methods and the hyperparameters of the algorithms.

To help these purposes, ORCA is mainly used through configuration files that describe experiments, but the methods can also be easily used through a common API.

Cite ORCA

If you use ORCA and/or associated datasets, please cite the following works:

J. Sánchez-Monedero, P. A. Gutiérrez and M. Pérez-Ortiz, 
"ORCA: A Matlab/Octave Toolbox for Ordinal Regression", 
Journal of Machine Learning Research. Vol. 20. Issue 125. 2019. http://jmlr.org/papers/v20/18-349.html

P.A. Gutiérrez, M. Pérez-Ortiz, J. Sánchez-Monedero, F. Fernandez-Navarro and C. Hervás-Martínez.
"Ordinal regression methods: survey and experimental study",
IEEE Transactions on Knowledge and Data Engineering, Vol. 28, January, 2016, pp. 127-146. http://dx.doi.org/10.1109/TKDE.2015.2457911

Bibtex entry:

@article{JMLR:v20:18-349,
  author  = {Javier S{{\'a}}nchez-Monedero and Pedro A. Guti{{\'e}}rrez and Mar{{\'i}}a P{{\'e}}rez-Ortiz},
  title   = {ORCA: A Matlab/Octave Toolbox for Ordinal Regression},
  journal = {Journal of Machine Learning Research},
  year    = {2019},
  volume  = {20},
  number  = {125},
  pages   = {1-5},
  url     = {http://jmlr.org/papers/v20/18-349.html}
}

@Article{Gutierrez2015,
  Title                    = {Ordinal regression methods: survey and experimental study},
  Author                   = {P.A. Guti\'errez and M. P\'erez-Ortiz and J. S\'anchez-Monedero and  F. Fernandez-Navarro and C. Herv\'as-Mart\'inez},
  Journal                  = {IEEE Transactions on Knowledge and Data Engineering},
  Year                     = {2016},
  Url                      = {http://dx.doi.org/10.1109/TKDE.2015.2457911},
  Volume                   = {28},
  Number                   = {1},
  pages                    = {127-146},
}

For more information about the paper and the ordinal datasets used please visit the associated website: http://www.uco.es/grupos/ayrna/orreview

For more information about our research group please visit Learning and Artificial Neural Networks (AYRNA) website at University of Córdoba (Spain).

Installation, tutorials and documentation

The documentation can be found in the doc folder and includes:

Methods included

The Algorithms folder includes the MATLAB classes for the algorithms included and the original code (if applicable). The config-files folder includes different configuration files for running all the algorithms. In order to use these files, the datasets used in the previously cited review paper are needed. To add your own method see Adding a new method to ORCA.

Running time of the algorithms was analysed in "Ordinal regression methods: survey and experimental study" (2016). From this analysis, it can be concluded that ELMOP, SVORLin and POM are the best option if computational cost is a priority. The training time of neural network methods (NNPOM and NNOP) and GPOR is in general the highest. This cost can be assumed for GPOR, given that it obtains very good performance for balanced ordinal datasets, while neural network-based methods are generally beaten by the ordinal SVM variants. Concerning scalability, the experimental setup in the review also included some relatively large datasets, so the practitioner could check the time it took to train one of those models with the ORCA framework. In general, linear models such as POM and SVORLin perform very well in these scenarios where there is plenty of data while still having a reasonably low running time (e.g. around 10 seconds for cross-validating, training and testing on a dataset of almost 22.000 patterns). Although very high-dimensional datasets were not considered in the analysis, it is well-known that SVMs can handle high-dimensional data, and given that they are one of the best performing methods in ordinal regression, this might be a good choice in such scenario.

Ordinal regression algorithms

Partial order methods

Nominal methods

Performance metrics

The measures folder contains the MATLAB classes for the metrics used for evaluating the classifiers. The measures included in ORCA are the following (more details about the metrics can be found in [14,15]:

Utilities, classes and scripts

Datasets

The example-data folder includes partitions of several small ordinal datasets for code testing purposes. We have also collected 44 publicly available ordinal datasets from various sources. These can be downloaded from: datasets-OR-review. The link also contains data partitions as used in different papers in the literature to ease experimental comparison. The characteristics of these datasets are the following:

Dataset #Pat. #Attr. #Classes Class distribution
pyrim5 (P5) 74 27 5 ~15 per class
machine5 (M5) 209 7 5 ~42 per class
housing5 (H5) 506 14 5 ~101 per class
stock5 (S5) 700 9 5 140 per class
abalone5 (A5) 4177 11 5 ~836 per class
bank5 (B5) 8192 8 5 ~1639 per class
bank5' (BB5) 8192 32 5 ~1639 per class
computer5 (C5) 8192 12 5 ~1639 per class
computer5' (CC5) 8192 21 5 ~1639 per class
cal.housing5 (CH5) 20640 8 5 4128 per class
census5 (CE5) 22784 8 5 ~4557 per class
census5' (CEE5) 22784 16 5 ~4557 per class
pyrim10 (P10) 74 27 10 ~8 per class
machine10 (M10) 209 7 10 ~21 per class
housing10 (H10) 506 14 10 ~51 per class
stock10 (S10) 700 9 10 70 per class
abalone10 (A10) 4177 11 10 ~418 per class
bank10 (B10) 8192 8 10 ~820 per class
bank10' (BB10) 8192 32 10 ~820 per class
computer10 (C10) 8192 12 10 ~820 per class
computer10' (CC10) 8192 21 10 ~820 per class
cal.housing (CH10) 20640 8 10 2064 per class
census10 (CE10) 22784 8 10 ~2279 per class
census10' (CEE10) 22784 16 10 ~2279 per class
Dataset #Pat. #Attr. #Classes Class distribution
contact-lenses (CL) 24 6 3 (15,5,4)
pasture (PA) 36 25 3 (12,12,12)
squash-stored (SS) 52 51 3 (23,21,8)
squash-unstored (SU) 52 52 3 (24,24,4)
tae (TA) 151 54 3 (49,50,52)
newthyroid (NT) 215 5 3 (30,150,35)
balance-scale (BS) 625 4 3 (288,49,288)
SWD (SW) 1000 10 4 (32,352,399,217)
car (CA) 1728 21 4 (1210,384,69,65)
bondrate (BO) 57 37 5 (6,33,12,5,1)
toy (TO) 300 2 5 (35,87,79,68,31)
eucalyptus (EU) 736 91 5 (180,107,130,214,105)
LEV (LE) 1000 4 5 (93,280,403,197,27)
automobile (AU) 205 71 6 (3,22,67,54,32,27)
winequality-red (WR) 1599 11 6 (10,53,681,638,199,18)
ESL (ES) 488 4 9 (2,12,38,100,116,135,62,19,4)
ERA (ER) 1000 4 9 (92,142,181,172,158,118,88,31,18)
marketing 8993 74 9 (1745,775,667,813,722,1110,969,1308,884)
thyroid 7200 21 3 (6666,166,368)
winequality-white 4898 11 7 (20,163,1457,2198,880,175,5)

Experiments parallelization with HTCondor

The condor folder contains the necessary files and steps for using HTCondor with our framework.

External software

ORCA makes use of the following external software implementations. For some of them, a Matlab interface has been developed through the use of MEX files.

Other contributors

Apart from the authors of the paper and the authors of the implementations referenced in "External software" section, the following persons also contributed to ORCA framework:

References