ValidityCheck should test if classificator and features are compatible

GoogleCodeExporter commented 9 years ago

The validation of the experiment setup should verify if features and 
classificator work together. 

For instance, it is possible to use String-features with NaiveBayes, the 
exception pointing that error out comes quite late during the 
classificator-creation. This is an invalid experimental setup and the error 
should be thrown much more earlier, during the validation step.

Caused by: weka.core.UnsupportedAttributeTypeException: 
weka.classifiers.bayes.NaiveBayes: Cannot handle string attributes!
    at weka.core.Capabilities.test(Capabilities.java:1001)
    at weka.core.Capabilities.test(Capabilities.java:887)
    at weka.core.Capabilities.test(Capabilities.java:1108)
    at weka.core.Capabilities.test(Capabilities.java:1045)
    at weka.core.Capabilities.testWithFail(Capabilities.java:1356)
    at weka.classifiers.bayes.NaiveBayes.buildClassifier(NaiveBayes.java:231)
    at de.tudarmstadt.ukp.dkpro.tc.weka.task.TestTask.execute(TestTask.java:176)
    at de.tudarmstadt.ukp.dkpro.lab.engine.impl.ExecutableTaskEngine.run(ExecutableTaskEngine.java:55)
    ... 9 more

Original issue reported on code.google.com by Tobias.H...@gmail.com on 15 Aug 2014 at 1:44

GoogleCodeExporter commented 9 years ago

I think such validation go beyond what ValidityCheckTask is intended to do. 
This task verifies whether the input data, the features and the type of 
learning (single- vs. multi-label) produce a valid task setup.
It should not verify whether each individual feature produces an outcome that 
is *compatible* with the specific learning algorithm used, since that is also 
framework-dependent and we don't want dependencies from TC core to the machine 
learning frameworks. 
Maybe we should disable string features altogether. Please re-open if you have 
better ideas.

Original comment by daxenber...@gmail.com on 15 Aug 2014 at 4:21

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

I agree with Johannes.

Just some additional comments:
- We should maybe not disable String features altogether, as the instanceId is 
also one :)
- It would be nice to fail-fast on such setups, but as this is a problem that 
occurs deep in the ML framework, I also don't see how we could avoid that.

Original comment by torsten....@gmail.com on 15 Aug 2014 at 6:44

ashokpant / dkpro-tc

ValidityCheck should test if classificator and features are compatible #176