Explaining the logic behing uniform_sensor_testcases?

flowersteam / explauto

An autonomous exploration library

http://flowersteam.github.io/explauto

GNU General Public License v3.0

64 stars 27 forks source link

Explaining the logic behing uniform_sensor_testcases? #76

Open jgrizou opened 8 years ago

jgrizou commented 8 years ago

I am now looking into the testcases generation. I find the method to generate uniform testcases over the sensory space very appealing. Code is here: https://github.com/flowersteam/explauto/blob/master/explauto/environment/testcase.py

My understanding is that a grid of a given resolution is projected on the sensory space, and that each cell is associated with only one observation from within that cell. My questions concern the resolution parameter:

is it the number of cut per dimension?
what is the logical behind the automatic calculation of resolution: resolution = max(2, int((1.3*n)**(1.0/len(robot.s_feats)))) ?

I also noticed this: # TODO : change obs only if nearer from center of coo.

From what I understand is that in each cell, the corresponding observation will be the last observation encountered in the _populate process. The todo is to replace that by keeping the closest to the center of the cell?

jgrizou commented 8 years ago

One effect of the grid system is that you do not always have an observation in each cell, so when you ask for 100 test_cases, you often endup with less.

Another method could consist of using KMeans, with k = number of test_cases to find cell centers. Then find the closest observation from the cluster center. This ensures you get 100 test_cases if you ask for 100. However, this is not really uniform, yet it is a good approximation for k<<n_samples. And the code already generates 100 times more samples than testcases: observations = uniform_motor_testcases(robot, 100*n).

Below is a small example, data in blue (1000 points), Kmean in red (20 points), selected in green (20 points). screenshot from 2016-08-09 16 23 15

jgrizou commented 8 years ago

Here is a comparison between the two methods:

Dataset 1000 points.

Grid: ask for 20 points, got 18. Selected in magenta (18 points) screenshot from 2016-08-09 16 45 37

Kmeans: ask for 20 points, got 20. Kmean in red (20 points), selected in green (20 points) screenshot from 2016-08-09 16 45 55

There is a pool of point at in the bottom-left corner for failed experiment, so it is normal that a sample is selected there.

Resolution was automatically computed with the formula in post 1, it gave 5 for this. so I guess a 5x5 grid, which is 25 cells, out of which only 18 were populated. Kmeans does look less uniform.

jgrizou commented 8 years ago

I think I will stick with the k-means because it ensures n-points. But it is not optimal.

What we really want here is a kind of SOM with a constraint that the vectrice should be of similar length. (Scaling the data between 0 and 1 in each dimension beforehand).