Closed GoogleCodeExporter closed 9 years ago
This issue reminds me a bit of Issue-396 - might be useful to consult that fix
for this.
Original comment by phi...@ogren.info
on 15 Mar 2014 at 6:11
Alexey, I'm curious to know if you have thought any more about how you would
like this issue to be resolved. I am inclined to recommend that we modify
org.cleartk.ml.feature.transform.extractor.ZeroMeanUnitStddevExtractor.train()
so that it throws an exception if the stddev is 0 when it is writing out the
MeanVarianceRunningStat objects. It seems like a very poor choice of a feature
if it only occurs once or it is always the same value. I think it would be
better to make sure it fails during training rather than somehow try to make it
work when classifying. What is your thought?
Original comment by phi...@ogren.info
on 12 Apr 2014 at 6:02
The feature I ran into it with was 3-grams, which will quite often be unique in
the learning data, and while the proper thing to do would be to add a special
feature for unique (or rare) n-grams, is there a simple way to do this in
ClearTK? I don't remember seeing one, but I am no longer using ClearTK actively.
Original comment by alexey.v...@gmail.com
on 12 Apr 2014 at 6:34
I'm not understanding the use case. Are you saying that you were training a
ZeroMeanUnitStddevExtractor for each unique 3-gram in your training data? For
starters, that feature extractor is meant for numeric features.
I would think that a TF-IDF feature for bag-of-ngrams might be a good starting
place.
Original comment by phi...@ogren.info
on 12 Apr 2014 at 6:47
Never mind my last comment. I just did a unit test on
MaxMinNormalizationExtractor and now I understand why you would count 3-grams
and submit them to the ZMUSE. I'll have to make a best guess as to the correct
behavior when I write the unit test here.
Original comment by phi...@ogren.info
on 12 Apr 2014 at 11:26
Ok - I think that if a feature only occurs once or it always has the same
value, then it is reasonable to return zero if the feature value being
transformed is the same as the mean. However, this is likely going to be a
pretty worthless feature. If the feature value being transformed is something
different than the mean, then I would consider that to be undefined. Even if
we came up with a reasonable estimate/default value it still isn't likely to be
a useful feature. I think a reasonable thing to do when transforming such
features is to return nothing and the list of returned features from the
extract method will be shorter.
Original comment by phi...@ogren.info
on 13 Apr 2014 at 1:16
if stddev = 0 or if a feature from the sub extractor has never been seen
before, then we will not create a zmus feature for it.
Original comment by phi...@ogren.info
on 13 Apr 2014 at 3:10
Yes, this seems very reasonable.
Original comment by alexey.v...@gmail.com
on 13 Apr 2014 at 7:06
Original issue reported on code.google.com by
alexey.v...@gmail.com
on 16 Jan 2014 at 6:24