Add builder-class to allow easier configuration of experiments

Horsmann commented 6 years ago

A builder-class that wraps the wiring of the parameter space into a less complex-looking class would ease using DKPro TC for beginners, i.e. hide all the Dimensions with their HashMaps in a builder.

Horsmann commented 6 years ago

@zesch there is now a builder class to hide the dimensions a bit. Any thoughts on this ExperimentBuilder ? see for instance dkpro-tc/dkpro-tc-examples/src/main/java/org/dkpro/tc/examples/MinimalWorkingExample.java

At least the code looks a bit more clean without the many dimensions :)

zesch commented 6 years ago

It looks like we only went half the way. It doesn't really serve the user group who wants control over the dimensions and it also doesn't really serve the beginners who just want to set the relevant stuff.

The minimal example (in my view) should only force the user to set input data, features and classifier (anything else?) and not expose any lab related stuff.

Horsmann commented 6 years ago

serve the beginners who just want to set the relevant stuff

There is no way of knowing if an experiment is classification or regression. This has to come from the outside. Once you know that you can add some default configured classifier that works for a classification setup, too. You can provide several facades that set these two values but then the user must be able to pick the right facade which also requires an understanding of feature mode and learning mode as it is implemented in TC.

It doesn't really serve the user group who wants control over the dimensions

This group would manually define the dimensions anyway, no? Furthermore, the builder expects that the minimal necessary stuff is set, which the advanced user would have to set, too. Anything else can be passed via the additional dimensions method. Which control is missing?

zesch commented 6 years ago

I think that the current facade exposes way to much unnecessary detail.

Regarding classification or regression, this is something the user should know and decide. Also whether you want train/test or CV. I was thinking more of something very easy like: TCExp exp = new TCExp(CollReadDesc data, ExpMode expM, MlMode mlMode, MachineLearningLibrary mllib, FeatureSet features) TCExpResult result = exp.run();

A little bit more optional control could be provided via additional setters, but I was imagining everything pure POJO.

reckart commented 6 years ago

@zesch I didn't look at the actual builder code, but a builder should be much better and self-explanatory than a constructor with many arguments.

new TCExperiment.Builder()
  .experimentMode(expM)
  .mlMode(MlMode)
  .backend(mllib)
  .reader(reader)
  .features(featureSet)
  .build()
  .run();

I do not know what a FeatureSet is, but it could possibly be folded in like

new TCExperiment.Builder()
  .experimentMode(expM)
  .mlMode(MlMode)
  .backend(mllib)
  .reader(reader)
  .addFeature(feature1)
  .addFeature(feature2)
  .addFeature(feature3)
  .build()
  .run();

zesch commented 6 years ago

@reckart I like the clear structure of your proposal.

In this style, it is difficult to distinguish mandatory from optional settings. What is best practice? To have meaningful error messages and let the users figure it out?

Horsmann commented 6 years ago

It probably makes sense to go both ways. @reckart's version would be something offered in dkpro-tc-ml and is directed at the intermediated/advanced user.

FeatureSet is Features are grouped in feature sets to allow experimenting with feature variations. You can add N feature sets with mostly the same features and see which one works best.

The second version would probably move to an own project dkpro-tc-simple (other suggestions for the name?), which would then be the single-constructor version. This one would also automatically load a default configuration for a classifier and sets some reasonable default configurations where needed. For the simple version it probably makes sense to assume that there will be only one feature set where you can call an add() method for single features as suggested by @reckart .

reckart commented 6 years ago

What is best practice? To have meaningful error messages and let the users figure it out?

Meaningful error messages and sensible defaults. Or one can put the mandatory fields into the builder constructors. Here are a few examples of different styles: https://stackoverflow.com/questions/7302891/the-builder-pattern-and-a-large-number-of-mandatory-parameters

reckart commented 6 years ago

@Horsmann I'd try not having different builders for different levels of users. Just different examples using the same builder targeted at different experience levels. E.g. features from the first example could become a varargs method which would create one feature set and add all the features to it. in addition you could have featureSets which would be a vararg for FeatureSets. Actually, I didn't like the addFeature too much because it mixes the "setter" builder style with an "incremental construction" builder style. It is not necessary that all methods offered by a build make sense in every combination - just the outcome should be defined. So e.g. the documentation should state that if multiple features or featureSets calls exist on the builder, only the last one will win -- or define an aggregation strategy, but that would mix setter with incremental style again.

reckart commented 6 years ago

new TCExperiment.Builder()
  .experimentMode(expM)
  .mlMode(MlMode)
  .backend(mllib)
  .reader(reader)
  .features(feature1, feature2, feature3)
  .build()
  .run();

new TCExperiment.Builder()
  .experimentMode(expM)
  .mlMode(MlMode)
  .backend(mllib)
  .reader(reader)
  .featureSets(featureSet1, featureSet2, featureSet3)
  .build()
  .run();

Horsmann commented 6 years ago

I added a setter style builder, which looks like this:

This configures a train-test experiment with two MLBackends using two features ( a single feature set).

       ExperimentBuilderV2 builder = new ExperimentBuilderV2();
        builder.setExperiment(ExperimentType.TRAIN_TEST, "trainTestExperiment", 2)
                .setReader(readerTrain, true)
                .setReader(readerTest, false)
                .setMachineLearningBackend(new MLBackend(new WekaAdapter(), SMO.class.getName()),
                                           new MLBackend(new LibsvmAdapter(), "-s", "1", "-c", "1000"))
                .setExperimentPreprocessing(createEngineDescription(BreakIteratorSegmenter.class))
                .setFeatureMode(FeatureMode.DOCUMENT)
                .setLearningMode(LearningMode.SINGLE_LABEL)
                .setFeatures(TcFeatureFactory.create(TokenRatioPerDocument.class),
                        TcFeatureFactory.create(WordNGram.class, 
                                WordNGram.PARAM_NGRAM_USE_TOP_K, 20, 
                                WordNGram.PARAM_NGRAM_MIN_N, 1, 
                                WordNGram.PARAM_NGRAM_MAX_N, 3)
                             )
                .run();

@reckart Is this what you had in mind?

@zesch The minimized default configuration is tricky. What is a useful default depends on the learning mode. Regression or classification (I would ignore pair entirely here). It would still need access to all ML-classifiers, which will force to create a new project otherwise I get a circular dependency of the projects- I have to cast-check if an adapter is Weka, Lib*, etc. otherwise I can't set any default values. This minimized configuration will then use above builder in the backend but probably offer essentially only a constructor which makes a lot of guess work and uses defaults. As soon one wants to modify some more details he has to use the normal builder.

Horsmann commented 6 years ago

@zesch The simplified version would look like this now:

´´´

    TcTrainTestExperiment tte = new TcTrainTestExperiment(readerTrain, readerTest,
            LearningMode.SINGLE_LABEL, FeatureMode.DOCUMENT, new WekaAdapter(), tcFeatureSet,
            AnalysisEngineFactory.createEngineDescription(BreakIteratorSegmenter.class));
    tte.run();

    TcCrossValidationExperiment cve = new TcCrossValidationExperiment(2, readerTrain,
            LearningMode.SINGLE_LABEL, FeatureMode.DOCUMENT, new WekaAdapter(), tcFeatureSet,
            AnalysisEngineFactory.createEngineDescription(BreakIteratorSegmenter.class));
    cve.run();

´´´

reckart commented 6 years ago

You are calling setReader twice - that looks strange. Maybe addDataset? But then you use a var-arg for the ML backends, so setDatasets(reader1, reader2, ...)?

For builders, I prefer a style without "set" - it is a builder, not a bean.

Horsmann commented 6 years ago

I changed the names.

The CV call looks like this now (using null for the not needed parameter).

        ExperimentBuilderV2 builder = new ExperimentBuilderV2();
        builder.experiment(ExperimentType.CROSS_VALIDATION, "crossValidationExperiment", numberFolds)
                .dataReaders(trainReader, null)
                .featureSets(featureSet)
                .learningMode(lm)
                .featureMode(fm)
                .machineLearningBackend(getDefault(adapter, lm))
                .experimentPreprocessing(preprocessing).run();

This looks still a bit cleaner than varargs since its only 1 or 2 readers anyway, the varargs would probably add to the confusion?

reckart commented 6 years ago

IMHO the builder should end with build() and the result should be a configuration/task that is then passed to Lab.run(). Would that work?

getDefault looks a bit magic. Also the null argument in the dataReaders() makes one question what is going on here.

How about a setter for the folds numbers and a default of 10 if it is not called?

Horsmann commented 6 years ago

Ok. how is this. dataReaderTrain() and dataReaderTest separated now, i.e. CV does not need the testreader call.

Newbie code:

     TcTrainTestExperiment tte = new TcTrainTestExperiment(readerTrain, readerTest,
                LearningMode.SINGLE_LABEL, FeatureMode.DOCUMENT, new WekaAdapter(), tcFeatureSet,
                AnalysisEngineFactory.createEngineDescription(BreakIteratorSegmenter.class));
        tte.run();

        TcCrossValidationExperiment cve = new TcCrossValidationExperiment(2, readerTrain,
                LearningMode.SINGLE_LABEL, FeatureMode.DOCUMENT, new WekaAdapter(), tcFeatureSet,
                AnalysisEngineFactory.createEngineDescription(BreakIteratorSegmenter.class));
        cve.run();

Anyone else:

        ExperimentBuilderV2 builder = new ExperimentBuilderV2();
        ShallowLearningExperiment_ImplBase experiment = builder.experiment(ExperimentType.TRAIN_TEST, "dummyExperiment")
               .dataReaderTrain(readerTrain)
               .dataReaderTest(readerTest)
               .experimentPreprocessing(AnalysisEngineFactory.createEngineDescription(BreakIteratorSegmenter.class))
               .features(TcFeatureFactory.create(TokenRatioPerDocument.class),
                        TcFeatureFactory.create(WordNGram.class,
                        WordNGram.PARAM_NGRAM_USE_TOP_K, 20,
                        WordNGram.PARAM_NGRAM_MIN_N, 1,
                        WordNGram.PARAM_NGRAM_MAX_N, 3)
                        )
                .learningMode(LearningMode.SINGLE_LABEL)
                .featureMode(FeatureMode.DOCUMENT)
                .machineLearningBackend(new MLBackend(new LiblinearAdapter()))
                .build();

        Lab.getInstance().run(experiment);

dkpro / dkpro-tc

Add builder-class to allow easier configuration of experiments #468