feature names can still be broken pretty easily

covartech / PRT

Pattern Recognition Toolbox for MATLAB

http://covartech.github.io/

MIT License

145 stars 70 forks source link

feature names can still be broken pretty easily #16

Open peterTorrione opened 11 years ago

peterTorrione commented 11 years ago

Code: yOut = kfolds(prtClassPlsda,ds,3); yOut.X = cat(2,yOut.X,randn(size(yOut.X))); yOut.featureNames

Index exceeds matrix dimensions.

Error in prtDataSetStandard/getFeatureNames (line 209) if ~isempty(obj.featureNameModificationFunction) && obj.featureNameModificationMask(iFeat)

Error in prtDataSetStandard/get.featureNames (line 58) fn = self.getFeatureNames;

patrickkwang commented 11 years ago

The following works as expected: yOut = kfolds(prtClassPlsda,ds,3); yOut = yOut.catFeatures(prtDataSetClass(randn(size(yOut.X)))); yOut.featureNames

I think I can make the sneakier yOut.X command work without errors, but there seems to be a deeper problem with it. getFeatureNames would be ignorant of the difference between yOut.X = cat(2,yOut.X,randn(size(yOut.X))); and yOut.X = cat(2,randn(size(yOut.X)),yOut.X);

Seems like the best behavior here would be for setData to wipe out any existing feature names. Maybe we should make a resetFeatureNames method?

peterTorrione commented 11 years ago

I'm not sure what the status of this is; we might need to re-factor feature names?

patrickkwang commented 11 years ago

I think the best solution would be to have setData erase any existing feature names (maybe only if the number of features changes). Does this sound reasonable?

kennethmorton commented 11 years ago

Erasing feature names seems like a cop out to me. I think one of our issues is that we haven't fully laid out tests and performance goals. A re-factor might be necessary.

peterTorrione commented 11 years ago

Agree. This bug (and the other EXPLORE problem) seem to point towards the need for a re-vamped unit-test framework...

The desired behavior of feature name and observation names are actually quite subtle when you start thinking about all the different ways you can modify data sets. I'm not sure what the long-term solution is. And I'm not even sure what the short-term solution is for this particular bug...

patrickkwang commented 11 years ago

The problem is that any time someone sets the data directly, we have no way of knowing how they've scrambled the features or observations. Short of comparing the old data to the new data and doing some sort of inference, I don't see how we can do anything more intelligent than throwing a warning to say "reset your feature and observation names."

peterTorrione commented 11 years ago

We definitely don't want to try and do inference...

But I think it's even more subtle than that. Even catFeatures or retainFeatures can act in ways that might confuse people:

ds1 = prtDataGenUnimodal; ds2 = ds1.retainFeatures(2); fprintf('%s = %s?\n',ds1.featureNames{2},ds2.featureNames{1})

Feature 2 = Feature 1?

Now, I know why we do that, but that decision was made from an implementation point of view; not from a user experience point of view. It's very tricky to get everything to work right.