dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

Numeric stability of floating-point features #436

Closed Horsmann closed 6 years ago

Horsmann commented 6 years ago

All numeric feature values should be in the range of 0..1 or -1..1, especially for SVMs. The boolean nature of most features fulfills the former range but especially length/count features are not normalized at the moment, i.e. document length, sentence length, number of tokens, etc. This might lead to numeric instability when few (length) features output large numbers while all other feature values are in the range of 0..1.

cf. https://stackoverflow.com/questions/15436367/svm-scaling-input-values

Two todos arise:

Horsmann commented 6 years ago

Problem is not so easy to solve. Document-relative feature extractors do this already for instance number of tokens per sentence relative to all sentence within the same document/cas; It would probably still be better to do the normalization globally over all document.

At the moment there is no way to know the maximum value of a feature and whether it is numeric or not (i.e. requires normalization?)

zesch commented 6 years ago

Can you give an example? Usually a MetaCollector should provide the necessary information to the FE. Tobias Horsmann notifications@github.com schrieb am Sa. 10. Feb. 2018 um 11:40:

Problem is not so easy to solve. Document-relative feature extractors do this already for instance number of tokens per sentence relative to all sentence within the same document/cas; It would probably still be better to do the normalization globally over all document.

At the moment there is no way to know the maximum value of a feature and whether it is numeric or not (i.e. requires normalization?)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-tc/issues/436#issuecomment-364642395, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkQ4JSjy7bfmxEdgpsAsToQi7xvUQR8ks5tTXIFgaJpZM4R57-t .

Horsmann commented 6 years ago

Ah, right. the meta collectors, I almost forget about them. This probably works, but has to be implemented / adapted for all features. This is some additional work, the unnormalized feature extractors should be consequently deleted from the TC backend feature repository. The FEs provided by TC should all be properly normalized.

Horsmann commented 6 years ago

Furthermore, it is tricky to decide if the majority of features that do some ratio-thingy relative to the current document should be kept or not, for instance AdjectiveEndingFeatureExtractor. The meta-collector solution seems cleaner to normalize over the maximum number over all documents rather than just over the instances in the current document.

zesch commented 6 years ago

It might make sense to have a feature that encodes between 0.0 and 1.0 e.g. the POS ratio in a document not the whole document collection.

The AdjectiveEndingFeatureExtractor is certainly not the best example, but there are others which should be kept.

There also is the question of why not keep things that make sense semantically. Behavior can be documented in the JavaDocs.

2018-02-10 20:02 GMT+01:00 Tobias Horsmann notifications@github.com:

Furthermore, it is tricky to decide if the majority of features that do some ratio-thingy relative to the current document should be kept or not, for instance AdjectiveEndingFeatureExtractor. The meta-collector solution seems cleaner to normalize over the maximum number over all documents rather than just over the instances in the current document.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-tc/issues/436#issuecomment-364680962, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkQ4P3L-Z_H5-yXYOGhieEOinzSvWpkks5tTefVgaJpZM4R57-t .