facebookresearch / balance

The balance python package offers a simple workflow and methods for dealing with biased data samples when looking to infer from them to some target population of interest.
https://import-balance.org
GNU General Public License v2.0
686 stars 42 forks source link

[BUG] method = 'rake' return AttributeError #73

Open geniusjenny opened 8 months ago

geniusjenny commented 8 months ago

Describe the bug

The same code has no error when running method ='ipw', and method = 'cbps', but return below error when using raking. The below code return error

sample_with_target.adjust(method = "rake",variables = variables) 
table_current.loc[feature, weight_col]                      
AttributeError: 'numpy.int64' object has no attribute 'loc'

Update on 2023/03/08

This bug is returned because some of the bin that appears in the sample has never appeared in the target. Once I add the sample to the target to make sure all bins appear in the target, the bug disappear.

Session information

Please run paste here the output of running the following in your notebook/terminal:

# Sessions info
import session_info
session_info.show(html=False, dependencies=True)

balance 0.9.1 balance_functions NA boto3 1.28.28 dateutil 2.8.2 matplotlib 3.7.2 numpy 1.24.4 pandas 1.4.3 psutil 5.9.5 seaborn 0.12.2 session_info 1.0.0 tqdm 4.65.0

OpenSSL 23.2.0 PIL 10.0.0 anyio NA arrow 1.2.3 asttokens NA attr 23.1.0 attrs 23.1.0 babel 2.12.1 backcall 0.2.0 beta_ufunc NA binom_ufunc NA botocore 1.31.28 brotli NA certifi 2023.05.07 cffi 1.15.1 charset_normalizer 3.2.0 cloudpickle 2.2.1 colorama 0.4.4 comm 0.1.3 coxnet NA cryptography 41.0.2 cvcompute NA cvelnet NA cvfishnet NA cvglmnet NA cvglmnetCoef NA cvglmnetPredict NA cvlognet NA cvmrelnet NA cvmultnet NA cycler 0.10.0 cython_runtime NA debugpy 1.6.7 decorator 5.1.1 defusedxml 0.7.1 elnet NA executing 1.2.0 fastjsonschema NA fishnet NA fqdn NA fsspec 2023.6.0 glmnet NA glmnetCoef NA glmnetControl NA glmnetPredict NA glmnetSet NA glmnet_python NA google NA hypergeom_ufunc NA idna 3.4 ipfn NA ipykernel 6.24.0 ipython_genutils 0.2.0 ipywidgets 8.0.7 isoduration NA jedi 0.18.2 jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.1 json5 NA jsonpointer 2.4 jsonschema 4.18.4 jsonschema_specifications NA jupyter_events 0.6.3 jupyter_server 2.7.0 jupyterlab_server 2.23.0 kiwisolver 1.4.4 loadGlmLib NA lognet NA markupsafe 2.1.3 matplotlib_inline 0.1.6 mpl_toolkits NA mrelnet NA nbformat 5.9.1 nbinom_ufunc NA ncf_ufunc NA overrides NA packaging 21.3 parso 0.8.3 patsy 0.5.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 3.9.1 plotly 5.15.0 prometheus_client NA prompt_toolkit 3.0.39 ptyprocess 0.7.0 pure_eval 0.2.2 pyarrow 12.0.1 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.15.1 pyparsing 3.0.9 pythonjsonlogger NA pytz 2023.3 referencing NA requests 2.31.0 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rpds NA s3fs 0.4.2 scipy 1.9.1 send2trash NA six 1.16.0 sklearn 1.3.0 sniffio 1.3.0 socks 1.7.1 stack_data 0.6.2 statsmodels 0.14.0 tenacity NA threadpoolctl 3.2.0 tornado 6.3.2 traitlets 5.9.0 typing_extensions NA uri_template NA urllib3 1.26.14 wcwidth 0.2.6 webcolors 1.13 websocket 1.6.1 wtmean NA yaml 6.0 zmq 25.1.0

IPython 8.14.0 jupyter_client 8.3.0 jupyter_core 5.3.1 jupyterlab 4.0.3 notebook 6.5.4

Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] Linux-5.10.209-198.812.amzn2.x86_64-x86_64-with-glibc2.26

Session information updated at 2024-03-05 04:21

Screenshots

If applicable, add screenshots to help explain your problem. image (4) image (5)

Reproducible example

Please provide us with (any that apply):

  1. Code: code we can run to reproduce the issue (in terminal or python notebook) sample = Sample.from_frame(sample_df2[:50]) target = Sample.from_frame(target_df2[:500]) sample_with_target = sample.set_target(target) adjusted_ads_weight = sample_with_target.adjust(method = "rake",variables = variables_subset2) sample_df2 and target_df2 are dataframes with two numerical columns. image (6)

  2. Reference: If the issue is in a tutorial, please provide the link to it, and the exact place in which the code fails.

Additional context

Add any other context about the problem here that might help us solve it.

talgalili commented 8 months ago

Hey @geniusjenny

Thanks for the bug report!

Could you please try to run the code from the rake tutorial: https://import-balance.org/docs/tutorials/quickstart_rake/ And see if you can reproduce the code from it?

What would help me is a fully self-contained reproducible example that I could run in my env to reproduce the error - that would allow me to more easily iterate to get a solution.

Thanks upfront!

geniusjenny commented 8 months ago

Thanks for the replies! For the sample code it runs smoothly with no error. image (7)

talgalili commented 8 months ago

Thanks for checking @geniusjenny Any way you could play around and try to find a way to reproduce the issue? I suggest you look at the sample.df.info() And look at the data types, and maybe the hint could be there.

Once you could find a way to reproduce the issue, I'd be able to work on it. WDYT?

geniusjenny commented 8 months ago

Hi talgalili, I tried to reproduce the issue but couldn't. I tried using two numerical features ['income', 'happiness'] similar with what I have for my dataset, and the code runs smoothly. I attached the sample data here for you to reproduce the issue. Sorry that I couldn't be more helpful.

Thank you so much. sample_test2.csv target_test2.csv code:

s2= pd.read_csv('sample_test2.csv',index_col=0)
t2= pd.read_csv('target_test2.csv',index_col=0)
sample = Sample.from_frame(s2)
target = Sample.from_frame(t2)
sample_with_target = sample.set_target(target)
adjusted_ads_weight1 = sample_with_target.adjust(method = "rake") 
talgalili commented 8 months ago

Thanks @geniusjenny

Just to double check, could you please paste the full output of you running the above code? And please also include the output of: sample.df.info() target.df.info()

Thanks!

geniusjenny commented 8 months ago

Sure! Full output: image (8) image (4) image (5)

df.info: image (9)

talgalili commented 8 months ago

Thanks! Could you please try to bucket the variables and try again?

I think rake should be defined on categorical variables and not numeric ones (how to correct it woth a default is a good question - but I'd like to double check that this is indeed the issue)

On Wed, 6 Mar 2024, 17:42 Han Wang, @.***> wrote:

Sure! Full output: image.8.png (view on web) https://github.com/facebookresearch/balance/assets/55514836/062abe70-d576-4c27-a7d4-406df8087a32 image.4.png (view on web) https://github.com/facebookresearch/balance/assets/55514836/5aeb9839-d033-495f-a1c4-a6647c96d031 image.5.png (view on web) https://github.com/facebookresearch/balance/assets/55514836/ea7096ab-95fd-4b1d-8dfb-84c2e042cb91

df.info: image.9.png (view on web) https://github.com/facebookresearch/balance/assets/55514836/b0063461-b472-4b75-b4ac-31ac8b75e3bc

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/balance/issues/73#issuecomment-1981168496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHOJBWPPETVDXACP5WN35TYW42QHAVCNFSM6AAAAABEGM4J7WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRGE3DQNBZGY . You are receiving this because you commented.Message ID: @.***>

geniusjenny commented 8 months ago

Hi talgalili, I just tried binning the numerical variables to categorical variables, but still the code returns the same error. While method='cbps' and method = 'ipw' run smoothly.

Here are the code and df.info: image (10) ERROR: image (11)

talgalili commented 8 months ago

Thanks @geniusjenny Interesting! Could you please change the object type of the bucketed variables from 'categorical' to 'object'? And let me know if this resolve the error you get?

geniusjenny commented 8 months ago

I also tried that. Still getting the same error.

image

geniusjenny commented 8 months ago

I think I may find the issue. Some of the bin that appears in the sample has never appeared in the target, causing this error. Once I add the sample to the target, the bug disappear. I suggest the code take this edge case in consideration as well!

t2=pd.concat([s2,t2])
t2.reset_index(inplace=True)
t2['id']=t2.index.astype('str')
image
talgalili commented 8 months ago

Great catch - thanks a bunch @geniusjenny !

O.k., I'll leave this issue open - and we'll get to add a proper exception in the future.

Thanks again.

geniusjenny commented 8 months ago

Thank you!

EmanueleCeglia commented 6 months ago

I jump in the issue because I have the same problem. In my case I have no missing data in the target. I am trying to use the marginal distribution with rake. If there is no weight column in the "sample" dataframe and target_df_from_marginals'' then is automatically created with values equal to 1. Then, I tried to create the column "weight" for both: "target_df_from_marginals'' and the dataframe used to create "sample" but instead of use 1 used 1.0 so dtype - float and this time the error message is: AttributeError: 'numpy.float64' object has no attribute 'loc' Do you have any suggestions? @talgalili

talgalili commented 6 months ago

Hey @EmanueleCeglia , Do you want to share the code you used? My guess is that you need to add the weight column to the DataFrame of your data before using Sample.from_frame so it will inherit from pandas the relevant methods.

EmanueleCeglia commented 6 months ago

image df_sorted is a dataframe with two columns: ctrysize and ctrysect (they are sorted in alphabetical order) this is my df in which I have to calibrate weights. ctrysize is the combination of 12 EU countries and for each country the dimension of the firm size (from 1 to 4) ctrysect is the combination of 12 EU countries and for each country the sector of the firm (from A to D).

For each of these combinations I have the real totals in EU and I want to use these data as margins for the calibration. In the picture below you can see how I used the totals to create the dictionaire "a_dict_with_marginal_distributions" image image then image image image Error image Hope it's clear enough. In any case I can provide additional details. Thanks @talgalili

talgalili commented 6 months ago

Hi @EmanueleCeglia

  1. could you please open a new issue for this discussion? (this seems like a separate issue)
  2. If you run this tutorial, does it work? https://import-balance.org/docs/tutorials/quickstart_rake/
  3. Notice that you have a huge amount of tiny buckets, regardless of this bug, are you sure you have values for each of them in your sample?

(please let's continue this discussion in the new bug you'll open - thanks)

EmanueleCeglia commented 6 months ago

Hi @talgalili yes the tutorial works perfectly

image

I am going to open a new issue so we can continue there