Waikato / meka

Multi-label classifiers and evaluation procedures using the Weka machine learning framework.
http://waikato.github.io/meka/
GNU General Public License v3.0
200 stars 76 forks source link

Prediction speed scales with training data size rather than output size #79

Open davidfstein opened 1 year ago

davidfstein commented 1 year ago

I am running some experiments with the Mulan wrapper. Particularly, I added the COCOA method from that repository and am running the following for training:

java -cp "~/bin/meka-release-1.9.8-SNAPSHOT/lib/*" meka.classifiers.multilabel.MULAN -S COCOA -verbosity 8 -split-percentage 100 -t "train.arff" -d "clf.dmp" -W weka.classifiers.trees.J48 and for inference: java -cp "~/bin/meka-release-1.9.8-SNAPSHOT/lib/*" meka.classifiers.multilabel.MULAN -S COCOA -verbosity 8 -t "train.arff" -T "test.arff" -l "clf.dmp" -W weka.classifiers.trees.J48

Notably, training time increases moderately but reasonably as "train.arff" grows. However, with a fixed "test.arff" size, inference time scales exponentially with "train.arff" size. It seems almost as if training is not actually occurring during the first command but rather in the second. My java is very rusty so perhaps that is indeed what is happening. Is this the expected behavior?

fracpete commented 1 year ago

I just submitted a fix (https://github.com/Waikato/meka/commit/0608eeffd56cbb109902719515b632559e21a6c7), that will allow you to evaluate a previously trained model on a test set. This wasn't possible before, the model always got retrained with the training data.

With the latest snapshot, you would use something like this:

java -cp "~/bin/meka-release-1.9.8-SNAPSHOT/lib/*" meka.classifiers.multilabel.MULAN -S COCOA -verbosity 8 -threshold 1 -T "test.arff" -l "clf.dmp"
davidfstein commented 1 year ago

Thanks for the quick fix!

I rebuilt from master, but I'm running into this error now:

java.lang.ArrayIndexOutOfBoundsException: Index 1341 out of bounds for length 1341 at weka.core.DenseInstance.value(DenseInstance.java:347) at mulan.transformations.BinaryRelevanceTransformation.transformInstance(BinaryRelevanceTransformation.java:126) at mulan.classifier.transformation.BinaryRelevance.makePredictionInternal(BinaryRelevance.java:83) at mulan.classifier.MultiLabelLearnerBase.makePrediction(MultiLabelLearnerBase.java:113) at mulan.classifier.transformation.COCOA.makePredictionforThreshold(COCOA.java:305) at mulan.classifier.transformation.COCOA.makePredictionInternal(COCOA.java:324) at mulan.classifier.MultiLabelLearnerBase.makePrediction(MultiLabelLearnerBase.java:113) at meka.classifiers.multilabel.MULAN.distributionForInstance(MULAN.java:263) at meka.classifiers.multilabel.Evaluation.testClassifier(Evaluation.java:617) at meka.classifiers.multilabel.Evaluation.evaluateModel(Evaluation.java:419) at meka.classifiers.multilabel.Evaluation.runExperiment(Evaluation.java:301) at meka.classifiers.multilabel.ProblemTransformationMethod.runClassifier(ProblemTransformationMethod.java:172) at meka.classifiers.multilabel.ProblemTransformationMethod.evaluation(ProblemTransformationMethod.java:152) at meka.classifiers.multilabel.MULAN.main(MULAN.java:273)

fracpete commented 1 year ago

Please provide a minimal example that replicates this problem.