AI-SDC / ACRO

Tools for the Automatic Checking of Research Outputs. These are the tools for researchers to use as drop-in replacements for commands that produce outputs in Stata Python and R
MIT License
15 stars 2 forks source link

updating crosstab #167

Closed mahaalbashir closed 1 year ago

mahaalbashir commented 1 year ago

Solving the problem of shape mismatch when there are two columns and the aggfunc is count or sum

codecov[bot] commented 1 year ago

Codecov Report

Merging #167 (79e2852) into main (eb1a405) will increase coverage by 0.79%. Report is 6 commits behind head on main. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #167      +/-   ##
==========================================
+ Coverage   98.82%   99.61%   +0.79%     
==========================================
  Files           9        9              
  Lines        1020     1038      +18     
==========================================
+ Hits         1008     1034      +26     
+ Misses         12        4       -8     
Files Coverage Δ
acro/acro_tables.py 99.76% <100.00%> (+1.98%) :arrow_up:
mahaalbashir commented 1 year ago

I deleted the empty columns from the table regardless of the aggregation functions. Although the errors occurred when the agg func is count or sum, I thought because the masks always delete the columns with zeros, we always want the columns with zeros to be deleted from the table as well.

jim-smith commented 1 year ago

@mahaalbashir So it looks like your solution always deletes empty columns from tables, even if suppress==False.

A couple of comments:

  1. looks like stata does this by default for frequency tables, (but not for interaction co-efficients) so that won;t be too unexpected for reseaarchers

  2. Can you confirm the circumstance under which this is the default behaviour for crosstab anyway please.

    • I think from what you said it does it already for mean, std deviation but for for count and sum?
  3. Your code only does this for columns, does this never apply to rows?

mahaalbashir commented 1 year ago

@jim-smith

  1. The solution always deletes empty columns from tables, even if suppress==False because in the current version of the code, the masks are applied to the table and the suppressed table is calculated even if suppress==False. Then if suppress is true the table is equal to the suppressed table otherwise it is equal to the original table.

  2. The circumstance under which this is the default behaviour for crosstab

    • What I have noticed while doing the pandas version of crosstab with different aggfuncs is that if the survivor column is used, when the aggfunc is mean or std the empty cells are represented as Nan. Therefore, if there is a column with empty values it will be deleted. However, if the agg func is count or sum the empty cells are represented as zeros. Therefore, if there is an empty column it will not be deleted and all its values will be zeros.
    • If the status column is used regardless of the aggfunc the empty cells are represented as Nan and if there are any columns with empty values they will be deleted.
    • The difference between the status and the survivor columns is that the status column is of type object while the survivor column is of type category.
  3. It happens for rows as well. I will include that in the code.