Open alextanski opened 5 years ago
👀
After some inspection, please try margin(weighted=False)
. There are no weighted counts, hence the fail of the margin()
, which assumes weighted=True
. This is buggy, but you might be able to solve this by this hack. Please let me know if this works (it should) while I address the other issue (properly).
Additionally, you can try the following:
>>> cube.slices[0]
CubeSlice(name='Shapes of pasta', dim_types='CAT', dims='Shapes of pasta')
N
----------- -------
Bucatini 39.4727
Chitarra 47.873
Boccoli 46.7192
Orecchiette 49.671
Quadrefiore 50.7232
Fileja 38.5867
>>> cube.margin(weighted=False)
array([1658])
>>> cube.margin(weighted=False, include_missing=True)
array([1662])
>>>
because that should include the "missings" of the opposite dimension, and thus be "unconditional".
@malecki can you comment (if I'm right or wrong)?
P.S. - results are from a different dataset than the one you used, lest there be confusion about the numbers...
Thanks for this @slobodan-ilic. I am not at work today, but from what I see above I guess there is a misunderstanding (probably due to my initial phrasing of the issue) about the "margin result": What we are after in this case is the "margin mean", i.e. the mean across all cases that contain valid data for Shapes of pasta
. I am pretty sure that the ScaleMeans
implementation of margin()
shows exactly that.
To add to my comment from above, here is the docstring from the ScaleMeans
version:
def margin(self, axis):
"""Return marginal value of the current slice scaled means.
This value is the the same what you would get from a single variable
(constituting a 2D cube/slice), when the "non-missing" filter of the
opposite variable would be applied. This behavior is consistent with
what is visible in the front-end client.
"""
We would need a numeric data equivalent here.
I don't think the scale_means
is the thing that we should be looking at here, whether it's the values or the margin thereof. It's the property of the single variable alone (numeric values of the categories, combined with counts). The mean is calculated differently on server side, actually calculating the mean of a different variable (the one you select in the mean) and then returning those values for each category of the original categorical variable (and presenting the resulting cube as "just that" categorical variable, even though the results are actually means of a different variable, that you don't explicitly see in the cube result).
So the solution here (and I've confirmed this with @malecki ), is to just make an additional request of what you want from the server. The same way you'd have to do it in our web client. So for the case that I've used, it would be something like (just use it without the crossed_by
part):
>>> cube = fetch_cube(ds, [], mean=mean)
>>> cube
CrunchCube(name='None', dim_types='')
>>> cube.as_array()
array(47.35198556)
Hm... I've just figured out that the weight
argument in fetch_cube
works a little bit weird. If you apply it, it kinda sets the weight for the entire DS, and everything from there on is weighted. It even sets the weight for me in the web app, which I wouldn't expect.
Ha! I did not even know that this is possible (I tried to simply fetch an "empty" cube but I think I passed None
or something). This is a perfect solution for #157. It does not solve the particular issue outlined in here though as the empty cube
mean is not the same as the crossed_by
"marginal" one. The former is truly unconditional, the latter should be restricted by the valid data for that dimension.
If this result is simply not obtainable server-sided this is perfectly fine and we should not dig any deeper. This issue came up looking at the interface from a consistency perspective. #157 (which should be solved by the code above) constituted a real blocker. Thanks @slobodan-ilic and @malecki for looking at that so quickly!
@jamesrkg: Agree to simply close and let both issues rest for now once I cheked against Rogo's deck / dataset?
We’re planning some work to improve how numeric variables are dealt with, in particular in multitables where the approach already is the unconditional row variable as the first subcube, followed by whatever other conditioning column variables. It is not currently possible to request the measure cube_mean of numerics via the multitable export endpoint at all, and the first task will be to remedy that.
Is there any way I can get the margin result for a 2D cube (first dimensions being a numerical variable, crossed by a categorical), i.e. the mean of all cases with non-empty data for the two dimensions? I am unable to find something that works like
cube.measures.scale_means.ScaleMeans.margin()
method for numerical data.Example setup:
Setting up the measure for the mean:
Then using
pycrunch.cubes.fetch_cube
and theCrunchCube
api to query the results from Crunch:Which gives me:
I guess the 1D structure of that cube would cause a
margin()
result to fail anyway? Which leads to the related question of how I would get unconditional / 1D statistics on numerical data in general: #157