Fix to Issue #57 has unintended consequences

peterTorrione commented 6 years ago

Checkout this revision: 85b6087b2917014530096c4f65c78c15189f966a

then run:

ds = prtDataGenUnimodal;
class = prtClassFld;
ds = class.kfolds(ds,10);
disp(ds)

You will see ds has userData is a [10 x 1] struct. This bit me in the ass when I was running some cross-validation experiments, because:

ds = class.kfolds(ds,100);
ds = class.kfolds(ds,100);
ds = class.kfolds(ds,100);

Will make a very large userData structure. Because in the first instance we make a ds with a 1x100 userData, and then we make a 1 x (100100) userData, then (1 x (100100*100)) etc. That's... no good.

Note that, as far as I can tell, this does NOT happen in the parent commit: f441f1ce2b9ad1e763a1efebffb9686c158fc074

The problem is that userData was never intended to be used to hold information from different cross-validation folds - userData was always intended to be a single, global set of information for the dataSet, and mucking with it inside kfolds breaks that assumption.

The bug above only really matters if you run cross-validaton on the same data set a bunch of times - it is admittedly not the worst problem in the world, but it can be weird. If userData was actually BIG this problem can be significant - I don't necessarily want 1000 copies of userData for my 1000 folds.

I don't know what the right answer is. Usually I'd suggest putting whatever this information that you want to store in userData on every fold in the ACTION (and use the second output of kfolds or crossValidate) but that's clearly sub-optimal and confusing.

We could ADD a new field to dataSets - kfoldsUserData, which is an nFolds x 1 array of userData? Then in cross-Validate, we can change prtDataSetBase.crossValidateCombineFoldResults to do this:

    for i = 1:length(dsTestCell)
         dsOut.kfoldsUserData(i) = dsTestCell(i).userData;
    end

Thoughts?

peterTorrione commented 6 years ago

Actually, Miles just convinced me a better solution is to just provide the individual dataSets out as an optional 3 (crossValidate) or 4th (kfolds) output - then you can do:

   [dsAgg,ascts,dsFolds] = class.kfolds(ds,10);

where dsFolds(1).userData, dsFolds(2).userData, etc. have the right things.

patrickkwang commented 6 years ago

Then what should the userData for dsAgg look like?

peterTorrione commented 6 years ago

I think its the userData from the original ds? (the input)

cratto commented 6 years ago

So my workaround has been to index the very big userData @peterTorrione referenced originally. It gets really big because I am doing 10-fold xval as well as one-vs-all classification. So I have a double for look that goes something like this:

for iClass = 1:nClasses
     for iFold = 1:nFold
        userDataIdx = iClass*(iFold-1)+iClass;
        data_i_need = dsOut.userData(userDataIdx).fieldName
      end
end

I like Miles' suggestion to output the individual data sets from kfolds, and then I can use each data set's observationInfo() instead. We would still need to figure out how to fill observationInfo() in a one-vs-all scheme.

Ultimately I think it would be easiest if my snippet would look more like:

for iClass = 1:nClasses
     for iFold = 1:nFold
        data_i_need = dsOutPerFold(iFold).observationInfo(iClass).fieldName
      end
end

peterTorrione commented 6 years ago

Oh holy cow. prtClassBinaryToMary does the same thing! Wow. That's... interesting.

I think I understand the use case. I don't think observationInfo works here, b/c it has to be nObservations x 1. And, I think we always thought a "pure" action shouldn't muck with userData b/c you (the user) might put stuff in userData, and you want it to come out the same way you put it in. I propose one of the following -

1) The idea above, which basically results in:

[dsAgg,acts,dsOutPerFold] = ...
for iClass = 1:nClasses
     for iFold = 1:nFold
        data_i_need = dsOutPerFold(iFold).userData(iClass).fieldName
      end
end

Per here - https://github.com/covartech/PRT/issues/57 I am pretty sure we want to NOT use observationInfo here b/c observationInfo is "special" and gets handled properly, e.g., in ds.retainObservations(...), and has to be nObservations x 1.

To avoid smooshing other people's userData, I would actually do something like

   function self = runAction(self,ds)
        ....
        ds.userData.prtClassFld = struct(all my info I want to store)
   end

If I was doing that in prtClassFld. In general this is probably safe

   function self = runAction(self,ds)
        ....
        ds.userData.(class(self)) = struct(all my info I want to store)
   end

2) We can also make a new field of prtDataSet* - call it "actionData". actionData is intended to be mucked with by actions and do all of the above with actionData rather than userData. We have to be careful that whenever we make multiple dataSets (e.g., in cross-validate, in prtClassBinaryToMary) we always concatenate actionData arrays if they are not empty I guess?

Thoughts?

THen you would do basically

for iClass = 1:nClasses
     for iFold = 1:nFold
         data_i_need = dsOutPerFold(iFold).actionInfo.prtClassBinaryToMary(iClass).classifierName.fieldName
      end
end

But I'm not 100% sure that's right either, and I need to go pick up my family...

cratto commented 6 years ago

I'm on board with either of these. Option 1 satisfies my needs (as does my hack mentioned above), but if Option 2 allows for userData to be used as originally intended, that's fine too.

covartech / PRT

Fix to Issue #57 has unintended consequences #66