Unconditional summary stats results on numerical data: margin case

alextanski commented 5 years ago

Is there any way I can get the margin result for a 2D cube (first dimensions being a numerical variable, crossed by a categorical), i.e. the mean of all cases with non-empty data for the two dimensions? I am unable to find something that works like cube.measures.scale_means.ScaleMeans.margin() method for numerical data.

Example setup:

Setting up the measure for the mean:

numeric = 'open_realrange'
mean = {
        "function": 'cube_mean',
        "args": [
            {
                "function": "cast",
                "args": [
                    {
                        "variable": "datasets/{}/variables/{}".format(
                            ds.id,
                            ds[numeric].id
                        )
                    },
                    {"class": "numeric"}
                ]
            }
        ]
    }

Then using pycrunch.cubes.fetch_cube and the CrunchCube api to query the results from Crunch:

from pycrunch.cubes import fetch_cube, count
from cr.cube.crunch_cube import CrunchCube

crossed_by = 'dropdown'
cube = fetch_cube(ds.resource, [crossed_by], mean=mean)
cube = CrunchCube(cube)

Which gives me:

CrunchCube(name='dropdown', dim_types='CAT')
slices[0]: CubeSlice(name='dropdown', dim_types='CAT', dims='dropdown')
               N
-------  -------
vl:2013  44
vl:2014  33.4167
vl:2015  33

I guess the 1D structure of that cube would cause a margin() result to fail anyway? Which leads to the related question of how I would get unconditional / 1D statistics on numerical data in general: #157

slobodan-ilic commented 5 years ago

👀

slobodan-ilic commented 5 years ago

After some inspection, please try margin(weighted=False). There are no weighted counts, hence the fail of the margin(), which assumes weighted=True. This is buggy, but you might be able to solve this by this hack. Please let me know if this works (it should) while I address the other issue (properly).

slobodan-ilic commented 5 years ago

Additionally, you can try the following:

>>> cube.slices[0]
CubeSlice(name='Shapes of pasta', dim_types='CAT', dims='Shapes of pasta')
                   N
-----------  -------
Bucatini     39.4727
Chitarra     47.873
Boccoli      46.7192
Orecchiette  49.671
Quadrefiore  50.7232
Fileja       38.5867

>>> cube.margin(weighted=False)
array([1658])
>>> cube.margin(weighted=False, include_missing=True)
array([1662])
>>>

because that should include the "missings" of the opposite dimension, and thus be "unconditional".

@malecki can you comment (if I'm right or wrong)?

P.S. - results are from a different dataset than the one you used, lest there be confusion about the numbers...

alextanski commented 5 years ago

Thanks for this @slobodan-ilic. I am not at work today, but from what I see above I guess there is a misunderstanding (probably due to my initial phrasing of the issue) about the "margin result": What we are after in this case is the "margin mean", i.e. the mean across all cases that contain valid data for Shapes of pasta. I am pretty sure that the ScaleMeans implementation of margin() shows exactly that.

alextanski commented 5 years ago

To add to my comment from above, here is the docstring from the ScaleMeans version:

    def margin(self, axis):
        """Return marginal value of the current slice scaled means.
        This value is the the same what you would get from a single variable
        (constituting a 2D cube/slice), when the "non-missing" filter of the
        opposite variable would be applied. This behavior is consistent with
        what is visible in the front-end client.
        """

We would need a numeric data equivalent here.

slobodan-ilic commented 5 years ago

I don't think the scale_means is the thing that we should be looking at here, whether it's the values or the margin thereof. It's the property of the single variable alone (numeric values of the categories, combined with counts). The mean is calculated differently on server side, actually calculating the mean of a different variable (the one you select in the mean) and then returning those values for each category of the original categorical variable (and presenting the resulting cube as "just that" categorical variable, even though the results are actually means of a different variable, that you don't explicitly see in the cube result).

So the solution here (and I've confirmed this with @malecki ), is to just make an additional request of what you want from the server. The same way you'd have to do it in our web client. So for the case that I've used, it would be something like (just use it without the crossed_by part):

>>> cube = fetch_cube(ds, [], mean=mean)
>>> cube
CrunchCube(name='None', dim_types='')
>>> cube.as_array()
array(47.35198556)

slobodan-ilic commented 5 years ago

Hm... I've just figured out that the weight argument in fetch_cube works a little bit weird. If you apply it, it kinda sets the weight for the entire DS, and everything from there on is weighted. It even sets the weight for me in the web app, which I wouldn't expect.

alextanski commented 5 years ago

Ha! I did not even know that this is possible (I tried to simply fetch an "empty" cube but I think I passed None or something). This is a perfect solution for #157. It does not solve the particular issue outlined in here though as the empty cube mean is not the same as the crossed_by "marginal" one. The former is truly unconditional, the latter should be restricted by the valid data for that dimension.

If this result is simply not obtainable server-sided this is perfectly fine and we should not dig any deeper. This issue came up looking at the interface from a consistency perspective. #157 (which should be solved by the code above) constituted a real blocker. Thanks @slobodan-ilic and @malecki for looking at that so quickly!

@jamesrkg: Agree to simply close and let both issues rest for now once I cheked against Rogo's deck / dataset?

malecki commented 5 years ago

We’re planning some work to improve how numeric variables are dealt with, in particular in multitables where the approach already is the unconditional row variable as the first subcube, followed by whatever other conditioning column variables. It is not currently possible to request the measure cube_mean of numerics via the multitable export endpoint at all, and the first task will be to remedy that.

Crunch-io / crunch-cube

Unconditional summary stats results on numerical data: margin case #156