Division by zero in ZeroMeanUnitStddevExtractor

GoogleCodeExporter commented 9 years ago

In case a particular feature only occurred once while training a 
ZeroMeanUnitStddevExtractor (or always occurred with the same value), 
stats.stddev will be 0, and so (value - stats.mean) / stats.stddev will be NaN 
leading to problems down the line. I am not sure what would be the best 
solution here.

Original issue reported on code.google.com by alexey.v...@gmail.com on 16 Jan 2014 at 6:24

GoogleCodeExporter commented 9 years ago

This issue reminds me a bit of Issue-396 - might be useful to consult that fix 
for this.

Original comment by phi...@ogren.info on 15 Mar 2014 at 6:11

Changed state: Accepted
Added labels: Milestone-2.0

GoogleCodeExporter commented 9 years ago

Alexey,  I'm curious to know if you have thought any more about how you would 
like this issue to be resolved.  I am inclined to recommend that we modify 
org.cleartk.ml.feature.transform.extractor.ZeroMeanUnitStddevExtractor.train() 
so that it throws an exception if the stddev is 0 when it is writing out the 
MeanVarianceRunningStat objects.  It seems like a very poor choice of a feature 
if it only occurs once or it is always the same value.  I think it would be 
better to make sure it fails during training rather than somehow try to make it 
work when classifying.  What is your thought?

Original comment by phi...@ogren.info on 12 Apr 2014 at 6:02

GoogleCodeExporter commented 9 years ago

The feature I ran into it with was 3-grams, which will quite often be unique in 
the learning data, and while the proper thing to do would be to add a special 
feature for unique (or rare) n-grams, is there a simple way to do this in 
ClearTK? I don't remember seeing one, but I am no longer using ClearTK actively.

Original comment by alexey.v...@gmail.com on 12 Apr 2014 at 6:34

GoogleCodeExporter commented 9 years ago

I'm not understanding the use case.  Are you saying that you were training a 
ZeroMeanUnitStddevExtractor for each unique 3-gram in your training data?  For 
starters, that feature extractor is meant for numeric features.  

I would think that a TF-IDF feature for bag-of-ngrams might be a good starting 
place.

Original comment by phi...@ogren.info on 12 Apr 2014 at 6:47

GoogleCodeExporter commented 9 years ago

Never mind my last comment.  I just did a unit test on 
MaxMinNormalizationExtractor and now I understand why you would count 3-grams 
and submit them to the ZMUSE.  I'll have to make a best guess as to the correct 
behavior when I write the unit test here.

Original comment by phi...@ogren.info on 12 Apr 2014 at 11:26

GoogleCodeExporter commented 9 years ago

Ok - I think that if a feature only occurs once or it always has the same 
value, then it is reasonable to return zero if the feature value being 
transformed is the same as the mean.  However, this is likely going to be a 
pretty worthless feature.  If the feature value being transformed is something 
different than the mean, then I would consider that to be undefined.  Even if 
we came up with a reasonable estimate/default value it still isn't likely to be 
a useful feature.  I think a reasonable thing to do when transforming such 
features is to return nothing and the list of returned features from the 
extract method will be shorter.

Original comment by phi...@ogren.info on 13 Apr 2014 at 1:16

GoogleCodeExporter commented 9 years ago

if stddev = 0 or if a feature from the sub extractor has never been seen 
before, then we will not create a zmus feature for it.

Original comment by phi...@ogren.info on 13 Apr 2014 at 3:10

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Yes, this seems very reasonable.

Original comment by alexey.v...@gmail.com on 13 Apr 2014 at 7:06

DrDub / cleartk

Division by zero in ZeroMeanUnitStddevExtractor #399