Waikato / wekaDeeplearning4j

Weka package for the Deeplearning4j java library
https://deeplearning.cms.waikato.ac.nz/
GNU General Public License v3.0
184 stars 197 forks source link

Bug: cannot run textembeddings in 1.5.14 #58

Closed kyllohd closed 4 years ago

kyllohd commented 4 years ago

Describe the bug After generating the word embeddings with weka 3.8 (weka.filters.unsupervised.attribute.Dl4jStringToWord2Vec). i've tried to use these embeddings in Meka 1.9.2 .

To Reproduce

  1. Go to Generate a word embedding with text on weka 3.8
  2. Load these embeddings to run a classification in meka 1.9.2
  3. Select the embeddings path
  4. See error

Expected behavior You should be able to use Dlj4MLP in Binary Relevance method in Meka with embeddings generated in weka.

Additional Information

Error [INFO ] 16:21:57.131 [Thread-6] weka.classifiers.functions.Dl4jMlpClassifier - Building on 6296 training instances meka.gui.explorer.ClassifyTab Evaluation failed (train/test split): weka.core.InvalidInputDataException: An ARFF is required with a string attribute and a class attribute at weka.dl4j.iterators.instance.sequence.text.rnn.RnnTextEmbeddingInstanceIterator.validate(RnnTextEmbeddingInstanceIterator.java:57) at weka.dl4j.iterators.instance.sequence.text.rnn.RnnTextEmbeddingInstanceIterator.getDataSetIterator(RnnTextEmbeddingInstanceIterator.java:80) at weka.dl4j.iterators.instance.AbstractInstanceIterator.getDataSetIterator(AbstractInstanceIterator.java:59) at weka.classifiers.functions.Dl4jMlpClassifier.getDataSetIterator(Dl4jMlpClassifier.java:1069) at weka.classifiers.functions.Dl4jMlpClassifier.getDataSetIterator(Dl4jMlpClassifier.java:1121) at weka.classifiers.functions.Dl4jMlpClassifier.getFirstBatchFeatures(Dl4jMlpClassifier.java:1449) at weka.classifiers.functions.Dl4jMlpClassifier.createModel(Dl4jMlpClassifier.java:1294) at weka.classifiers.functions.Dl4jMlpClassifier.finishClassifierInitialization(Dl4jMlpClassifier.java:957) at weka.classifiers.functions.Dl4jMlpClassifier.initializeClassifier(Dl4jMlpClassifier.java:899) at weka.classifiers.functions.Dl4jMlpClassifier.buildClassifier(Dl4jMlpClassifier.java:816) at meka.classifiers.multilabel.BR.buildClassifier(BR.java:75) at meka.classifiers.multilabel.Evaluation.evaluateModel(Evaluation.java:428) at meka.classifiers.multilabel.Evaluation.evaluateModel(Evaluation.java:326) at meka.gui.explorer.ClassifyTab$7.run(ClassifyTab.java:414) at java.lang.Thread.run(Unknown Source) at meka.gui.explorer.AbstractThreadedExplorerTab$WorkerThread.run(AbstractThreadedExplorerTab.java:78)

kyllohd commented 4 years ago

I first started this just with Meka, then I went to Weka 3.8.3 to test this as standalone thing and the issue is on TextEmbedding (tried both cnnembedding and rnn embedding)

Here there are 3 files:
"tmdb-embeddings.arff" which I was trying to use as embeddings in the weka.classifiers.functions.Dl4jMlpClassifier. tmdb_dummytrain.arff - training dataset tmdb_dummytest.arff - test set https://www.mediafire.com/file/phq192elgynm7sz/embeddings.zip/file

Steps to reproduce (cnn):

  1. Open Weka 3.8.3
  2. Open Explorer
  3. Load tmdb_dummytrain.arff
  4. Select the Classify tab
  5. Select weka.classifiers.functions.Dl4jMlpClassifier
  6. Select the simplecnn as zoo.
  7. Select on the Iterator > the sequence > text> CnnTextEmbeddingIterator >
  8. Select the tmdb-embeddings.arff file as word vector.

Actual results: You won't even be able to start the classification


Steps to reproduce (rnn):

  1. Open Weka 3.8.3
  2. Open Explorer
  3. Load tmdb_dummytrain.arff
  4. Select the Classify tab
  5. Select weka.classifiers.functions.Dl4jMlpClassifier
  6. Select the simplecnn as zoo. Select on the Iterator the sequence -> text> RnnTextEmbeddingInstanceIterator
  7. Click on RnnTextEmbeddingInstanceIterator select the tmdb-embeddings.arff file.
  8. Close the settings window
  9. Click to start the classification

Actual results: You can start the classification, but you get an error that it cannot handle string class:

08:33:34: Started weka.classifiers.functions.Dl4jMlpClassifier 08:33:34: Command: weka.classifiers.functions.Dl4jMlpClassifier -S 0 -cache-mode MEMORY -early-stopping "weka.dl4j.earlystopping.EarlyStopping -maxEpochsNoImprovement 0 -valPercentage 0.0" -normalization "Standardize training data" -iterator "weka.dl4j.iterators.instance.sequence.text.rnn.RnnTextEmbeddingInstanceIterator -stopWords \"weka.dl4j.text.stopwords.Dl4jRainbow \" -tokenPreProcessor \"weka.dl4j.text.tokenization.preprocessor.CommonPreProcessor \" -tokenizerFactory \"weka.dl4j.text.tokenization.tokenizer.factory.DefaultTokenizerFactory \" -truncationLength 100 -wordVectorLocation E:\mestrado\bases\embeddings\tmdb-embeddings.arff -bs 1" -iteration-listener "weka.dl4j.listener.EpochListener -eval true -n 5" -layer "weka.dl4j.layers.BatchNormalization -beta 0.0 -decay 0.9 -eps 1.0E-5 -gamma 1.0 -beta false -nOut 0 -activation \"weka.dl4j.activations.ActivationIdentity \" -name \"Batch normalization layer\"" -layer "weka.dl4j.layers.DenseLayer -nOut 0 -activation \"weka.dl4j.activations.ActivationReLU \" -name \"Dense layer\"" -layer "weka.dl4j.layers.SubsamplingLayer -mode Same -eps 1.0E-8 -rows 2 -columns 2 -paddingColumns 0 -paddingRows 0 -pnorm 0 -poolingType MAX -strideColumns 2 -strideRows 2 -name maxpool1" -layer "weka.dl4j.layers.DenseLayer -nOut 500 -activation \"weka.dl4j.activations.ActivationReLU \" -name ffn1" -layer "weka.dl4j.layers.RnnOutputLayer -lossFn \"weka.dl4j.lossfunctions.LossMCXENT \" -nOut 2 -activation \"weka.dl4j.activations.ActivationSoftmax \" -name \"RnnOutput layer\"" -logConfig "weka.core.LogConfiguration -append true -dl4jLogLevel WARN -logFile C:\Users\mansu\wekafiles\wekaDeeplearning4j.log -nd4jLogLevel INFO -wekaDl4jLogLevel INFO" -config "weka.dl4j.NeuralNetConfiguration -biasInit 0.0 -biasUpdater \"weka.dl4j.updater.Sgd -lr 0.001 -lrSchedule \\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\"\" -dist \"weka.dl4j.distribution.Disabled \" -dropout \"weka.dl4j.dropout.Disabled \" -gradientNormalization None -gradNormThreshold 1.0 -l1 NaN -l2 NaN -minimize -algorithm STOCHASTIC_GRADIENT_DESCENT -updater \"weka.dl4j.updater.Adam -beta1MeanDecay 0.9 -beta2VarDecay 0.999 -epsilon 1.0E-8 -lr 0.001 -lrSchedule \\"weka.dl4j.schedules.ConstantSchedule -scheduleType EPOCH\\"\" -weightInit XAVIER -weightNoise \"weka.dl4j.weightnoise.Disabled \"" -numEpochs 10 -numGPUs 1 -averagingFrequency 10 -prefetchSize 24 -queueSize 0 -zooModel "weka.dl4j.zoo.CustomNet " -output-debug-info -num-decimal-places 4 08:33:34: weka.classifiers.functions.Dl4jMlpClassifier: Cannot handle string attributes!

zahrashuaib commented 4 years ago

Hi, i tried to use RnnTextEmbeddingInstanceIterator for IMDB dataset, but it through error Dl4jClassifier: cannot handle string attributes! Any help please?

braun-steven commented 4 years ago

@kyllohd Sorry for the late reply. You are using the wrong model: Please choose RnnSequenceClassifier instead of Dl4jMlpClassifier.

@zahrashuaib The same goes for you.