andremun / pyInstanceSpace

A Python version of the InstanceSpace analysis toolkit
1 stars 0 forks source link

Instance Space Analysis: A toolkit for the assessment of algorithmic power

Tests Static Badge

Instance Space Analysis is a methodology for the assessment of the strengths and weaknesses of an algorithm, and an approach to objectively compare algorithmic power without bias introduced by restricted choice of test instances. At its core is the modelling of the relationship between structural properties of an instance and the performance of a group of algorithms. Instance Space Analysis allows the construction of footprints for each algorithm, defined as regions in the instance space where we statistically infer good performance. Other insights that can be gathered from Instance Space Analysis include:

The unique advantage of visualizing algorithm performance in the instance space, rather than as a small set of summary statistics averaged across a selected collection of instances, is the nuanced analysis that becomes possible to explain strengths and weaknesses and examine interesting variations in performance that may be hidden by tables of summary statistics.

This repository provides a set of Python tools to carry out a complete Instance Space Analysis in an automated pipeline. We expect it to become the computational engine that powers the Melbourne Algorithm Test Instance Library with Data Analytics (MATILDA) web tools for online analysis. For further information on the Instance Space Analysis methodology can be found here.

If you follow the Instance Space Analysis methodology, please cite as follows:

K. Smith-Miles and M.A. Muñoz. Instance Space Analysis for Algorithm Testing: Methodology and Software Tools. ACM Comput. Surv. 55(12:255),1-31 DOI:10.1145/3572895, 2023.

Also, if you specifically use this code, please cite as follows:

TBD

DISCLAIMER: This repository contains research code. In occassions new features will be added or changes are made that may result in crashes. Although we have have made every effort to reduce bugs, this code has NO GUARANTIES. If you find issues, let us know ASAP through the contact methods described at the end of this document.

Installation Instructions

To be expanded

run pip install ./matilda-0.1.0.tar.gz

An example of running can be found in integration_demo.py

An example of a plugin can be found in example_plugin.py

Documentation Instructions

run pdoc matilda

See the pdoc documentation for instructions on exporting static html files for hosting in github pages.

Development Environment Setup Guide

REQUIREMENTS:

Step 1: Install poetry

Linux, Mac, WSL

curl -sSL https://install.python-poetry.org | python3 -

Windows

(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -

Step 2: Setup virtual environment

poetry shell

Step 3: Install python dependencies into virtual environment

poetry install

Working with the code

We will update this explanation in the next few months. Here is a copy of the description of the several inputs passed as a json file.

The metadata file

The metadata.csv file should contain a table where each row corresponds to a problem instance, and each column must strictly follow the naming convention mentioned below:

Moreover, empty cells, NaN or null values are allowed but not recommended. We expect you to handle missing values in your data before processing. You may use this file as reference.

Options

The script example.m constructs a structure that contains all the settings used by the code. Broadly, there are settings required for the analysis itself, settings for the pre-processing of the data, and output settings. For the first these are divided into general, dimensionality reduction, bound estimation, algorithm selection and footprint construction settings. For the second, the toolkit has routines for bounding outliers, scale the data and select features.

General settings

Dimensionality reduction settings

The toolkit uses PILOT as a dimensionality reduction method, with BFGS as numerical solver. Technical details about it can be found here.

Empirical bound estimation settings.

The toolkit uses CLOISTER, an algorithm based on correlation to detect the empirical bounds of the Instance Space.

Algorithm selection settings

The toolkit uses SVMs with radial basis kernels as algorithm selection models, through MATLAB's Statistics and Machine Learning Toolbox or LIBSVM.

Footprint construction settings

The toolkit uses TRACE, an algorithm based on MATLAB's polyshapes to define the regions in the space where we statistically infer good algorithm performance. The polyshapes are then pruned to remove those sections for which the evidence, as defined by a minimum purity value, is poor or non-existing.

Automatic data bounding and scaling

The toolkit implements simple routines to bound outliers and scale the data. These routines are by no means perfect, and users should pre-process their data independently if preferred. However, the automatic bounding and scaling routines should give some idea of the kind of results may be achieved. In general, we recommend that the data is transformed to become close to normally distributed due to the linear nature of PILOT's optimal projection algorithm.

Automatic feature selection

The toolkit implements SIFTED, a routine to select features, given their cross-correlation and correlation to performance. Ideally, we want the smallest number of orthogonal and predictive features. This routine are by no means perfect, and users should pre-process their data independently if preferred. In general, we recommend using no more than 10 features as input to PILOT's optimal projection algorithm, due to the numerical nature of its solution and issues in identifying meaningful linear trends.

Output settings

These settings result in more information being stored in files or presented in the console output.

Contact

If you have any suggestions or ideas (e.g. for new features), or if you encounter any problems while running the code, please use the issue tracker or contact us through the MATILDA's Queries and Feedback page.

Acknowledgements

Partial funding for the development of this code was provided by the Australian Research Council through the Industrial Transformation Training Centre grant IC200100009.

This code was developed as part of the subject SWEN90017-18, by students Junheng Chen, Yusuf Berdan Guzel, Kushagra Khare, Dong Hyeog Jang, Kian Dsouza, Nathan Harvey, Tao Yu, Xin Xiang, Jiaying Yi, and Cheng Ze Lam. The team was mentored by Ben Golding, and the subject was coordinated by Mansooreh Zahedi.