Closed Horsmann closed 6 years ago
@zesch there is now a builder class to hide the dimensions a bit. Any thoughts on this ExperimentBuilder
? see for instance dkpro-tc/dkpro-tc-examples/src/main/java/org/dkpro/tc/examples/MinimalWorkingExample.java
At least the code looks a bit more clean without the many dimensions :)
It looks like we only went half the way. It doesn't really serve the user group who wants control over the dimensions and it also doesn't really serve the beginners who just want to set the relevant stuff.
The minimal example (in my view) should only force the user to set input data, features and classifier (anything else?) and not expose any lab related stuff.
serve the beginners who just want to set the relevant stuff
There is no way of knowing if an experiment is classification or regression. This has to come from the outside. Once you know that you can add some default configured
classifier that works for a classification setup, too. You can provide several facades that set these two values but then the user must be able to pick the right facade which also requires an understanding of feature mode and learning mode as it is implemented in TC.
It doesn't really serve the user group who wants control over the dimensions
This group would manually define the dimensions anyway, no? Furthermore, the builder expects that the minimal necessary stuff is set, which the advanced user would have to set, too. Anything else can be passed via the additional dimensions method. Which control is missing?
I think that the current facade exposes way to much unnecessary detail.
Regarding classification or regression, this is something the user should know and decide. Also whether you want train/test or CV.
I was thinking more of something very easy like:
TCExp exp = new TCExp(CollReadDesc data, ExpMode expM, MlMode mlMode, MachineLearningLibrary mllib, FeatureSet features)
TCExpResult result = exp.run();
A little bit more optional control could be provided via additional setters, but I was imagining everything pure POJO.
@zesch I didn't look at the actual builder code, but a builder should be much better and self-explanatory than a constructor with many arguments.
new TCExperiment.Builder()
.experimentMode(expM)
.mlMode(MlMode)
.backend(mllib)
.reader(reader)
.features(featureSet)
.build()
.run();
I do not know what a FeatureSet is, but it could possibly be folded in like
new TCExperiment.Builder()
.experimentMode(expM)
.mlMode(MlMode)
.backend(mllib)
.reader(reader)
.addFeature(feature1)
.addFeature(feature2)
.addFeature(feature3)
.build()
.run();
@reckart I like the clear structure of your proposal.
In this style, it is difficult to distinguish mandatory from optional settings. What is best practice? To have meaningful error messages and let the users figure it out?
It probably makes sense to go both ways. @reckart's version would be something offered in dkpro-tc-ml
and is directed at the intermediated/advanced user.
FeatureSet is Features are grouped in feature sets to allow experimenting with feature variations. You can add N feature sets with mostly the same features and see which one works best.
The second version would probably move to an own project dkpro-tc-simple
(other suggestions for the name?), which would then be the single-constructor version. This one would also automatically load a default configuration for a classifier and sets some reasonable default configurations where needed.
For the simple
version it probably makes sense to assume that there will be only one feature set where you can call an add() method for single features as suggested by @reckart .
What is best practice? To have meaningful error messages and let the users figure it out?
Meaningful error messages and sensible defaults. Or one can put the mandatory fields into the builder constructors. Here are a few examples of different styles: https://stackoverflow.com/questions/7302891/the-builder-pattern-and-a-large-number-of-mandatory-parameters
@Horsmann I'd try not having different builders for different levels of users. Just different examples using the same builder targeted at different experience levels. E.g. features
from the first example could become a varargs method which would create one feature set and add all the features to it. in addition you could have featureSets
which would be a vararg for FeatureSets. Actually, I didn't like the addFeature
too much because it mixes the "setter" builder style with an "incremental construction" builder style. It is not necessary that all methods offered by a build make sense in every combination - just the outcome should be defined. So e.g. the documentation should state that if multiple features
or featureSets
calls exist on the builder, only the last one will win -- or define an aggregation strategy, but that would mix setter with incremental style again.
new TCExperiment.Builder()
.experimentMode(expM)
.mlMode(MlMode)
.backend(mllib)
.reader(reader)
.features(feature1, feature2, feature3)
.build()
.run();
new TCExperiment.Builder()
.experimentMode(expM)
.mlMode(MlMode)
.backend(mllib)
.reader(reader)
.featureSets(featureSet1, featureSet2, featureSet3)
.build()
.run();
I added a setter
style builder, which looks like this:
This configures a train-test experiment with two MLBackends using two features ( a single feature set).
ExperimentBuilderV2 builder = new ExperimentBuilderV2();
builder.setExperiment(ExperimentType.TRAIN_TEST, "trainTestExperiment", 2)
.setReader(readerTrain, true)
.setReader(readerTest, false)
.setMachineLearningBackend(new MLBackend(new WekaAdapter(), SMO.class.getName()),
new MLBackend(new LibsvmAdapter(), "-s", "1", "-c", "1000"))
.setExperimentPreprocessing(createEngineDescription(BreakIteratorSegmenter.class))
.setFeatureMode(FeatureMode.DOCUMENT)
.setLearningMode(LearningMode.SINGLE_LABEL)
.setFeatures(TcFeatureFactory.create(TokenRatioPerDocument.class),
TcFeatureFactory.create(WordNGram.class,
WordNGram.PARAM_NGRAM_USE_TOP_K, 20,
WordNGram.PARAM_NGRAM_MIN_N, 1,
WordNGram.PARAM_NGRAM_MAX_N, 3)
)
.run();
@reckart Is this what you had in mind?
@zesch
The minimized default configuration
is tricky. What is a useful default depends on the learning mode. Regression or classification (I would ignore pair entirely here). It would still need access to all ML-classifiers, which will force to create a new project otherwise I get a circular dependency of the projects- I have to cast-check if an adapter is Weka, Lib*, etc. otherwise I can't set any default values.
This minimized configuration will then use above builder in the backend but probably offer essentially only a constructor which makes a lot of guess work and uses defaults. As soon one wants to modify some more details he has to use the normal builder.
@zesch The simplified version would look like this now:
´´´
TcTrainTestExperiment tte = new TcTrainTestExperiment(readerTrain, readerTest,
LearningMode.SINGLE_LABEL, FeatureMode.DOCUMENT, new WekaAdapter(), tcFeatureSet,
AnalysisEngineFactory.createEngineDescription(BreakIteratorSegmenter.class));
tte.run();
TcCrossValidationExperiment cve = new TcCrossValidationExperiment(2, readerTrain,
LearningMode.SINGLE_LABEL, FeatureMode.DOCUMENT, new WekaAdapter(), tcFeatureSet,
AnalysisEngineFactory.createEngineDescription(BreakIteratorSegmenter.class));
cve.run();
´´´
You are calling setReader
twice - that looks strange. Maybe addDataset
? But then you use a var-arg for the ML backends, so setDatasets(reader1, reader2, ...)
?
For builders, I prefer a style without "set" - it is a builder, not a bean.
I changed the names.
The CV call looks like this now (using null for the not needed parameter).
ExperimentBuilderV2 builder = new ExperimentBuilderV2();
builder.experiment(ExperimentType.CROSS_VALIDATION, "crossValidationExperiment", numberFolds)
.dataReaders(trainReader, null)
.featureSets(featureSet)
.learningMode(lm)
.featureMode(fm)
.machineLearningBackend(getDefault(adapter, lm))
.experimentPreprocessing(preprocessing).run();
This looks still a bit cleaner than varargs since its only 1 or 2 readers anyway, the varargs would probably add to the confusion?
IMHO the builder should end with build()
and the result should be a configuration/task that is then passed to Lab.run(). Would that work?
getDefault
looks a bit magic. Also the null
argument in the dataReaders()
makes one question what is going on here.
How about a setter for the folds numbers and a default of 10 if it is not called?
Ok. how is this. dataReaderTrain() and dataReaderTest separated now, i.e. CV does not need the testreader call.
Newbie code:
TcTrainTestExperiment tte = new TcTrainTestExperiment(readerTrain, readerTest,
LearningMode.SINGLE_LABEL, FeatureMode.DOCUMENT, new WekaAdapter(), tcFeatureSet,
AnalysisEngineFactory.createEngineDescription(BreakIteratorSegmenter.class));
tte.run();
TcCrossValidationExperiment cve = new TcCrossValidationExperiment(2, readerTrain,
LearningMode.SINGLE_LABEL, FeatureMode.DOCUMENT, new WekaAdapter(), tcFeatureSet,
AnalysisEngineFactory.createEngineDescription(BreakIteratorSegmenter.class));
cve.run();
Anyone else:
ExperimentBuilderV2 builder = new ExperimentBuilderV2();
ShallowLearningExperiment_ImplBase experiment = builder.experiment(ExperimentType.TRAIN_TEST, "dummyExperiment")
.dataReaderTrain(readerTrain)
.dataReaderTest(readerTest)
.experimentPreprocessing(AnalysisEngineFactory.createEngineDescription(BreakIteratorSegmenter.class))
.features(TcFeatureFactory.create(TokenRatioPerDocument.class),
TcFeatureFactory.create(WordNGram.class,
WordNGram.PARAM_NGRAM_USE_TOP_K, 20,
WordNGram.PARAM_NGRAM_MIN_N, 1,
WordNGram.PARAM_NGRAM_MAX_N, 3)
)
.learningMode(LearningMode.SINGLE_LABEL)
.featureMode(FeatureMode.DOCUMENT)
.machineLearningBackend(new MLBackend(new LiblinearAdapter()))
.build();
Lab.getInstance().run(experiment);
A builder-class that wraps the wiring of the
parameter space
into a less complex-looking class would ease using DKPro TC for beginners, i.e. hide all the Dimensions with their HashMaps in a builder.