Allow to limit maximum sequence length for ngram features

GateNLP / gateplugin-LearningFramework

A plugin for the GATE language technology framework for training and using machine learning models. Currently supports Mallet (MaxEnt, NaiveBayes, CRF and others), LibSVM, Scikit-Learn, Weka, and DNNs through Pytorch and Keras.

https://gatenlp.github.io/gateplugin-LearningFramework/

GNU Lesser General Public License v2.1

26 stars 6 forks source link

Allow to limit maximum sequence length for ngram features #110

Closed johann-petrak closed 5 years ago

johann-petrak commented 5 years ago

This can be crucial if we use the deep learning backend. Ideally it should be possible to limit this in the feature specification (this can reduce the initial dataset size), then limit even more in the pytorch backend through a parameter (for further experimenting).

For single feature datasets, sorting the training set by sequence length would be a good alternative to avoid excessive padding.

See https://github.com/GateNLP/gate-lf-python-data/issues/23

johann-petrak commented 5 years ago

See https://github.com/GateNLP/gate-lf-python-data/issues/23 This has now be implemented so that initial settings in the feature specification file can be added. However, the LF completely ignores this for sparse representations and does not shorten anything itself for dense representations.