Closed peterTorrione closed 11 years ago
Looks like the issue is with how featureNameModificationFunction is passed around. I think the base feature names after concatenation are ['','asdf','asdf2'], and yOut had a modifier that says "replace everything with sprintf('FLD Output_{%s}',featureInd)". This gets passed too, and so that's what you see. Try keyboarding inside prtDataSetStandard.getFeatureNames for the concatenated dataSet and look at obj.featureNameIntegerAssocArray. Maybe we want to somehow rasterize the featureNameModifiers before we catFeatures? I don't know why this feature name modification scheme is even here, so this may violate some other requirement.
Patrick - you're absolutely right. That's the source of the bug. Why is it written that way? The underlying logic is two fold:
1) We don't want to store feature names when we don't have too. They can be super time-consuming to make, so we evaluate them "lazily" whenever we can. Imagine LIBS data (which is what spurred this change) with 10,000 features, and 1000 observations. Everytime we did everything we spent a few seconds making strings: feature_1, feature_2, ... feature_10000. Which no one enjoyed.
2) We want each prtAction to be able to append a function to add some text, e.g., PCA + PLSDA --> PLSDA{PCA{feature#}}, and for the names to get evaluated lazily.
So, as you send data sets through actions, they all append their little functions onto the pre-existing function. As you saw - the problem is that catFeatures just takes the fnHandle from the first dataset, and doesn't know that the second data set may have come from another pipeline with its own idea about feature names.
There is not exactly a straightforward fix to this.
So that's the history / source of the current bug
Instead of a single featureNameModificationFunction, you could hold a small cell array of featureNameModificationFunctions and include with each one a binary vector indicating which features it applies to. The functions would have to be prioritized (sort of nested) in a clever way. The binary vectors would have to be passed around, copied, etc, but I don't think this would be as bad as using all the strings directly. If handling the binary vectors is sufficiently quick, this seems to solve the issue and meet the requirements for lazy evaluation.
Right. I think something like this is the right answer... Not sure who is going to do it. You interested? Or we can get Kenny on board.
I can take a swing at it.
If that sounds fun to you, then please!
Make sure that: catFeatures, removeFeatures, retainFeatures, getFeatureNames all act right... might need to make some unit tests for these.
Pushed fix. catFeatures, removeFeatures, retainFeatures, and getFeatureNames should all work. Not sure about speed when handling large data sets.
This seems to work great. I don't notice any performance hit. Great fix.
ds = prtDataGenFeatureSelection(100,1000);
algo = prtPreProcZmuv/prtPreProcPca('nComponents',8)/ (prtPreProcPls('nComponents',3) + prtPreProcZmuv) + prtPreProcZmuv; algo = train(algo,ds); dsOut = algo.run(ds);
Does this red-line for anyone else?
ds = prtDataGenUnimodal; dsPca = rt(prtPreProcPca,ds); dsC = catFeatures(dsPca,ds); dsC.featureNames
Yeah. That doesn't fly.
Patrick, are there unit tests for these?
I removed hasFeatures(). The above tests now both work for me.
I looked around, and the only place hasFeatures seemed to be used was in prtDataSetStandard.
Also, prtTest and prtTestSmoke both run with the new mods. So this seems to be OK.
ds = prtDataGenUnimodal; ds.featureNames = {'asdf','asdf2'}; yOut = rt(prtClassFld,ds); plot(catFeatures(yOut,ds))
All features are now called the wrong thing.
Not sure what to do with this. Seems to be a pretty fundamental problem in the way feature names are handled...