Closed johann-petrak closed 5 years ago
When exporting the features (names only from the ARFF header), they are absolutely identical
The data itself is impossible to compare directly since the dimensions of the sparse vectors are different depending on the order in which the alphabet gets created.
Comparing the data from a tiny corpus in Weka shows that all features are identical as well, so this may be the result of how the maxent optimization depends on the order of instances. Closing for now.
We get a slight difference in the model depending on how many duplicates are created. This is observed on sentence classification using the mallet maxent algorithm.
The number of documents and dimensions are identical, but the model accuracy on a development set differs by a few 0.001