ardila / behavioralthor

Improving models of the ventral stream using crowd sourced behavioral measurements
1 stars 1 forks source link

Directions forward #2

Open ardila opened 11 years ago

ardila commented 11 years ago

@yamins81 There are 2 main goals

1) A screening set that is representative of the difficulty in the 1000-way categorization task, for creating a challenge submission

2) A screening set that is representative of the difficulty that humans are good at in all of imagenet, for getting better neural fits

re 1) We should use random L3 models (5 sets of features, one from each random model) and find a set of images that is hard to separate on average for the model class. This would mean extracting #N1 images from each synset, then getting margins for all 2-ways for each image. Then, we could just take the mean of the set of negative margins for each image as a score, and take the #N2 lowest scoring images.

re 2) We should find the largest negative margins as above, but then for each of these margins, test it in humans. This means that we will have ranked list of tuples ranked by margin (most negative first): (image, distractor_synset, margin)

And we will search through this set of image tuples using psychophysics to find the first (going down the ordered list) #N2 tuples that have a performance above some threshold.

Here are some training curve results for MCC2 classification The results for linearsvc are still being calculated (takes about 210 minutes to generate one of these curves.)

screen shot 2013-09-30 at 5 22 44 pm

Immediate points of action: 1) Deciding how many images per synset to extract (#N1), then extracting them. 2) Deciding the size of the screening set (#N2)

N1 seems to be around 400 given the training curve (saturation around 300-350, need 50-100 test examples)

If you agree with this decision for #N1, then I will create a new dataset called PixelHardSynsets which you should then extract

import imagenet
dataset = imagenet.dldataset.PixelHardSynsets
ardila commented 11 years ago

MCC confusion matrices HMO confusion matrix screen shot 2013-09-30 at 6 06 53 pm

Pixel confusion matrix screen shot 2013-09-30 at 6 08 31 pm

yamins81 commented 11 years ago

Comments:

1) For both of the two options above, we could also replace the "average-L3-hard" score with the "HMO-0" score, correct? By HMO-0, I mean the current HMO model as extracted so far. I haven't yet completely thought through what I think is best here. Or we could do V1-hard or HMAX-hard, right? I am kind of leaning toward HMO0-hard at the moment. What are your thoughts?

2) The method you described might generally be called "worst margin", e.g. you pick images as the ones with the worst margin on a classifier. I think this should be amended in two ways: a) First, we should make sure that any margins are those averaged over a set of splits, so that the "bad images" are truly those that have stably bad margins, regardless of the specific distractors. b) We should include a set of additional distractors that are randomly chosen with respect to margins. The reason to do this is that oftentimes, I have had the impression that "hard images" (or hard objects) are hard because of the "easier" distractors that are also in a given set. In other words, those "easier distractors" being present is important to expose the difficulty of a given "hard" image or object. If we remove all the easy ones, than it might suddenly look easy to solve the hard images, because they get "moved into place" on top of where the easy ones used to be. Then, once we try to combine the solution back in complementarily, it won't work. So, we'll want to keep at least some easy distractors around that are uniformly distributed in image space with regard to margins on the test algorithms (and classes).

3) I assume that you think we should draw the images from which to choose these set from the pixel-hard synsets as opposed to random 250K images. This is why you're saying that we'll start extracting from tomorrow the "PixelHardSynsets" set, right? How many hard synsets are you thinking? Or will that be set by N1 to fix the size of the total set?

--> On a separate note, what we're doing here is basically stacking a

hierarchical series of increasingly stringent tests, to winnow down the set. Starting with pixels, and using that to cut down the set a lot. Then we can cut it down further with HMO0- or HMAX- or whatever we decide on point 1) above. We then run THAT through either HMO procedure directly, or THEN through humans to re-weight it.

4) From the plot you made for the HMO0 model, I don't agree that we're seeing saturation, though, in the performance as a function of training examples. In fact, it looks to me like its slow (approximately logarithmic) increase, much like in the case of HvM. I expect that performance will keep increasing slowly with the number of examples. But I think N1 = 400 is fine, probably, since we don't need to push out to saturation, we just need a representative sample.

5) Does this plan relate clearly to the psychophysics plan you came up with a couple of months ago? Can you spell that out a little more explicitly now, again?

On Mon, Sep 30, 2013 at 5:47 PM, Diego Ardila notifications@github.comwrote:

@yamins81 https://github.com/yamins81 There are 2 main goals

1) A screening set that is representative of the difficulty in the 1000-way categorization task, for creating a challenge submission

2) A screening set that is representative of the difficulty that humans are good at in all of imagenet, for getting better neural fits

re 1) We should use random L3 models (5 sets of features, one from each random model) and find a set of images that is hard to separate on average for the model class. This would mean extracting #N1 images from each synset, then getting margins for all 2-ways for each image. Then, we could just take the mean of the set of negative margins for each image as a score, and take the

N2 lowest scoring images.

re 2) We should find the largest negative margins as above, but then for each of these margins, test it in humans. This means that we will have ranked list of tuples ranked by margin (most negative first): (image, distractor_synset, margin)

And we will search through this set of image tuples using psychophysics to find the first (going down the ordered list) #N2 tuples that have a performance above some threshold.

Here are some training curve results for MCC2 classification The results for linearsvc are still being calculated (takes about 210 minutes to generate one of these curves.)

[image: screen shot 2013-09-30 at 5 22 44 pm]https://f.cloud.github.com/assets/2701347/1241139/cdc9188e-2a17-11e3-99a5-c8acd6783cb1.png

Immediate points of action: 1) Deciding how many images per synset to extract (#N1), then extracting them. 2) Deciding the size of the screening set (#N2)

N1 seems to be around 400 given the training curve (saturation around

300-350, need 50-100 test examples)

If you agree with this decision for #N1, then I will create a new dataset called PixelHardSynsets which you should then extract

import imagenetdataset = imagenet.dldataset.PixelHardSynsets

— Reply to this email directly or view it on GitHubhttps://github.com/ardila/behavioralthor/issues/2 .

ardila commented 11 years ago

Some vocabulary: challenge subset -> dataset for goal one imagenet subset -> dataset for goal two

1)

The various options are Just pixels V1 <- probably will require some engineering effort/setup time from me Hmax <- probably will require some engineering effort/setup time from me V1+HMax <- probably will require some engineering effort/setup time from me Random L3 HMO The problem with HMO hard is that if we believe that HMO is capturing key axes of difficulty, then we will be removing those from the dataset. This is ok if we have some principled way of combining the model we screen on the challenge subset with our existing model, but even if we do, at some point we should think about regularization (how many times is it fair to screen on a new dataset and add more components to the model). If we are not combining models, then we want to remove only the axes of difficulty that will automatically be captured by almost any member of the model class, which is why I suggested random L3s

2)

a) agreed b) Once we have a set of tuples with high deltas: (image, distractor_synset, delta = margin-human performance as margin (using logistic regression)) We can construct the imagenet subset in several ways, here is one suggestion: If we think of the deltas as weights, then every distractor synset will have some amount of weight summed over all the tuples. We should take a random sample of images from each synset whose size proportional to the synset's weight. There is now one free parameter: # hard images/# images from distractor synsets which can be set empirically to ensure that the screening set is actually difficult for HMO-0.

3)

It depends on N1. Since we've agreed N1=400 is ok, the number of synsets depends on the budget for extraction which you said was 250,000 (833 synsets). If that is correct, then you should begin extraction of PixelHardSynsets ASAP (should be ready in 15 minutes from the time I post this)

ardila commented 11 years ago

PixelHardSynsets is now available: e93d9e99547c2fe05e48d264bf9219589ca9bc54

Here are svm results (not much different from MCC results): screen shot 2013-10-01 at 2 58 35 pm

I am also running the following classifiers using compute metric base: 5-NN and SGDClassifier

ardila commented 11 years ago

hmo conf_mat3

ardila commented 10 years ago

@yamins81 In talking with Jim about priorities, I think we came to the conclusion that we need to take advantage of the work I've done so far in some way, instead of just dropping it all to move to a new problem. Looking through what I have I was wondering if you still thought that "finding the hard parts of imagenet" is a useful goal.

I'm pretty convinced that I've done this: I have all 2-ways of the best model I can run, and found the densest part of the space. I have measured the human and model performance at just a few points in this space and it looks like there is a significant gap with humans (just not in 2-ways because humans and models are near ceiling). If you are not convinced of this gap, what would it take?

Is it possible to run the HMO procedure again on a combination of however much of this dense space would be appropriate + the synthetic set from before?

At the very least, I want run some sort of apples-to-apples comparison on Imagenet with HMO and the others, especially since I've found that

  1. The gap on HvM is still significant, and here HMO is the most consistent with humans
  2. The consistency between humans and the convnet models is generally low on imagenet, especially in the dense subspace.