Analyse differences in trained model depending on number of duplicates

GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.

https://gatenlp.github.io/gateplugin-LearningFramework/

GNU Lesser General Public License v2.1

26 stars 6 forks source link

Analyse differences in trained model depending on number of duplicates #108

Closed johann-petrak closed 5 years ago

johann-petrak commented 5 years ago

We get a slight difference in the model depending on how many duplicates are created. This is observed on sentence classification using the mallet maxent algorithm.

The number of documents and dimensions are identical, but the model accuracy on a development set differs by a few 0.001

johann-petrak commented 5 years ago

When exporting the features (names only from the ARFF header), they are absolutely identical

johann-petrak commented 5 years ago

The data itself is impossible to compare directly since the dimensions of the sparse vectors are different depending on the order in which the alphabet gets created.

johann-petrak commented 5 years ago

Comparing the data from a tiny corpus in Weka shows that all features are identical as well, so this may be the result of how the maxent optimization depends on the order of instances. Closing for now.