Waikato / moa

MOA is an open source framework for Big Data stream mining. It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation.
http://moa.cms.waikato.ac.nz/
GNU General Public License v3.0
609 stars 353 forks source link

confStream: Automated Algorithm Configuration for Stream Clustering #201

Closed MatthiasCarnein closed 4 years ago

MatthiasCarnein commented 4 years ago

Hi, this pull request adds our confStream algorithm to MOA. confStream allows to automatically find the best stream clustering algorithm and its best parameter configuration for a given data stream in real-time. As a result, the user no longer has to choose certain parameter settings but the algorithm can do this automatically and over time. In our results, this has improved the clustering quality considerably.

image

For this, the algorithm trains an ensemble of different algorithms and configurations in parallel. Periodically, the most promising configurations will create offsprings which can improve the overall quality over time. A brief overview of this is shown below and more information is available in the corresponding papers:

image

The initial idea has been presented at the Workshop on Automated Data Science at ECML PKDD '19. More detailed results will be presented at this years LION 14 conference.

An overview of the algorithm and the published papers is also available here: https://www.carnein.com/confstream

The general idea of confStream is inspired by the BLAST algorithm:

The algorithm can be selected like any other clustering algorithm in MOA. However, the algorithm makes use of a configuration file which lists all the algorithms and their parameters to optimise. All stream clustering algorithms in MOA as well as all their parameters can be selected.

An example configuration file is available in "moa/moa/src/main/java/moa/clusterers/meta/" and also shown below. Most importantly, the configuration file specifies a list of algorithms and their parameters to tune. Here, DenStream and ClusTree as well as all their relevant parameters are tuned.

{
    "windowSize": 1000,
    "ensembleSize": 20,
    "newConfigurations": 10,
    "keepCurrentModel": "true",
    "reinitialiseWithClusters": "true",
    "preventAlgorithmDeath": "true",
    "evaluateMacro": "false",
    "keepGlobalIncumbent": "true",
    "keepAlgorithmIncumbents": "true",
    "keepInitialConfigurations": "true",
    "useTestEnsemble": "true",
    "lambda": 0.05,
    "resetProbability": 0.01,
    "numberOfCores": 1,
    "performanceMeasure": "SilhouetteCoefficient",
    "performanceMeasureMaximisation": "true",
    "algorithms": [
        {
            "algorithm": "denstream.WithDBSCAN",
            "parameters": [
                {"parameter": "e", "type":"numeric", "value":0.02, "range":[0,1]},
                {"parameter": "b", "type":"numeric", "value":0.2, "range":[0,1]},
                {"parameter": "m", "type":"integer", "value":1, "range":[0,10000]},
                {"parameter": "o", "type":"integer", "value":2, "range":[2,20]},
                {"parameter": "l", "type":"numeric", "value":0.25, "range":[0,1]}
            ]
        }
        ,
        {
            "algorithm": "clustree.ClusTree",
            "parameters": [
                {"parameter": "H", "type":"integer", "value":8, "range":[1,20]},
                {"parameter": "B", "type":"boolean", "value":"false"}
            ]
        }
    ]
}