Closed Horsmann closed 6 years ago
Problem is not so easy to solve. Document-relative feature extractors do this already for instance number of tokens per sentence relative to all sentence within the same document/cas; It would probably still be better to do the normalization globally over all document.
At the moment there is no way to know the maximum value of a feature and whether it is numeric or not (i.e. requires normalization?)
Can you give an example? Usually a MetaCollector should provide the necessary information to the FE. Tobias Horsmann notifications@github.com schrieb am Sa. 10. Feb. 2018 um 11:40:
Problem is not so easy to solve. Document-relative feature extractors do this already for instance number of tokens per sentence relative to all sentence within the same document/cas; It would probably still be better to do the normalization globally over all document.
At the moment there is no way to know the maximum value of a feature and whether it is numeric or not (i.e. requires normalization?)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-tc/issues/436#issuecomment-364642395, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkQ4JSjy7bfmxEdgpsAsToQi7xvUQR8ks5tTXIFgaJpZM4R57-t .
Ah, right. the meta collectors, I almost forget about them. This probably works, but has to be implemented / adapted for all features. This is some additional work, the unnormalized feature extractors should be consequently deleted from the TC backend feature repository. The FEs provided by TC should all be properly normalized.
Furthermore, it is tricky to decide if the majority of features that do some ratio-thingy relative to the current document should be kept or not, for instance AdjectiveEndingFeatureExtractor
. The meta-collector solution seems cleaner to normalize over the maximum number over all documents rather than just over the instances in the current document.
It might make sense to have a feature that encodes between 0.0 and 1.0 e.g. the POS ratio in a document not the whole document collection.
The AdjectiveEndingFeatureExtractor is certainly not the best example, but there are others which should be kept.
There also is the question of why not keep things that make sense semantically. Behavior can be documented in the JavaDocs.
2018-02-10 20:02 GMT+01:00 Tobias Horsmann notifications@github.com:
Furthermore, it is tricky to decide if the majority of features that do some ratio-thingy relative to the current document should be kept or not, for instance AdjectiveEndingFeatureExtractor. The meta-collector solution seems cleaner to normalize over the maximum number over all documents rather than just over the instances in the current document.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-tc/issues/436#issuecomment-364680962, or mute the thread https://github.com/notifications/unsubscribe-auth/ACkQ4P3L-Z_H5-yXYOGhieEOinzSvWpkks5tTefVgaJpZM4R57-t .
All numeric feature values should be in the range of 0..1 or -1..1, especially for SVMs. The boolean nature of most features fulfills the former range but especially length/count features are not normalized at the moment, i.e. document length, sentence length, number of tokens, etc. This might lead to numeric instability when few (length) features output large numbers while all other feature values are in the range of 0..1.
cf. https://stackoverflow.com/questions/15436367/svm-scaling-input-values
Two todos arise:
@ATTRIBUTE class
with fixed values; could be worth a try to encode boolean features as true/false rather than as numeric - which is a open class. The extra information might pay off.