handle duplicate features consistently across ML implementations

GoogleCodeExporter commented 9 years ago

As discussed on the mailing list, different feature encoders do different 
things when encountering duplicate features:

https://groups.google.com/d/topic/cleartk-users/B2cfZSUX7W0/discussion

For example, FeatureVectorFeaturesEncoder adds together the counts for 
identical feature names,
NameNumberFeaturesEncoder produces duplicate NameNumber pairs, and 
FeatureNodeArrayEncoder throws away all but the last value.

All the feature encoders should do the same thing. A few options:

* Add values together, as in, FeatureVectorFeaturesEncoder, though this doesn't 
make much sense for Boolean valued features

* Throw an exception, requiring the annotator to de-duplicate. This might be 
conceptually the simplest thing to do, but might require substantially more 
work from the annotator.

In addition to true duplicates, we also need to figure out what we should do 
when two features with the same name but *different* values are given.

Original issue reported on code.google.com by steven.b...@gmail.com on 1 Mar 2013 at 9:24

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 3 May 2013 at 8:44

Added labels: Milestone-2.1
Removed labels: Milestone-1.4

GoogleCodeExporter commented 9 years ago

Original comment by phi...@ogren.info on 15 Mar 2014 at 5:41

Added labels: Milestone-2.2
Removed labels: Milestone-2.1

fangfangli / cleartk

handle duplicate features consistently across ML implementations #350