Open geniusjenny opened 8 months ago
Hey @geniusjenny
Thanks for the bug report!
Could you please try to run the code from the rake tutorial: https://import-balance.org/docs/tutorials/quickstart_rake/ And see if you can reproduce the code from it?
What would help me is a fully self-contained reproducible example that I could run in my env to reproduce the error - that would allow me to more easily iterate to get a solution.
Thanks upfront!
Thanks for the replies! For the sample code it runs smoothly with no error.
Thanks for checking @geniusjenny Any way you could play around and try to find a way to reproduce the issue? I suggest you look at the sample.df.info() And look at the data types, and maybe the hint could be there.
Once you could find a way to reproduce the issue, I'd be able to work on it. WDYT?
Hi talgalili, I tried to reproduce the issue but couldn't. I tried using two numerical features ['income', 'happiness'] similar with what I have for my dataset, and the code runs smoothly. I attached the sample data here for you to reproduce the issue. Sorry that I couldn't be more helpful.
Thank you so much. sample_test2.csv target_test2.csv code:
s2= pd.read_csv('sample_test2.csv',index_col=0)
t2= pd.read_csv('target_test2.csv',index_col=0)
sample = Sample.from_frame(s2)
target = Sample.from_frame(t2)
sample_with_target = sample.set_target(target)
adjusted_ads_weight1 = sample_with_target.adjust(method = "rake")
Thanks @geniusjenny
Just to double check, could you please paste the full output of you running the above code? And please also include the output of: sample.df.info() target.df.info()
Thanks!
Sure! Full output:
df.info:
Thanks! Could you please try to bucket the variables and try again?
I think rake should be defined on categorical variables and not numeric ones (how to correct it woth a default is a good question - but I'd like to double check that this is indeed the issue)
On Wed, 6 Mar 2024, 17:42 Han Wang, @.***> wrote:
Sure! Full output: image.8.png (view on web) https://github.com/facebookresearch/balance/assets/55514836/062abe70-d576-4c27-a7d4-406df8087a32 image.4.png (view on web) https://github.com/facebookresearch/balance/assets/55514836/5aeb9839-d033-495f-a1c4-a6647c96d031 image.5.png (view on web) https://github.com/facebookresearch/balance/assets/55514836/ea7096ab-95fd-4b1d-8dfb-84c2e042cb91
df.info: image.9.png (view on web) https://github.com/facebookresearch/balance/assets/55514836/b0063461-b472-4b75-b4ac-31ac8b75e3bc
— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/balance/issues/73#issuecomment-1981168496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHOJBWPPETVDXACP5WN35TYW42QHAVCNFSM6AAAAABEGM4J7WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRGE3DQNBZGY . You are receiving this because you commented.Message ID: @.***>
Hi talgalili, I just tried binning the numerical variables to categorical variables, but still the code returns the same error. While method='cbps' and method = 'ipw' run smoothly.
Here are the code and df.info: ERROR:
Thanks @geniusjenny Interesting! Could you please change the object type of the bucketed variables from 'categorical' to 'object'? And let me know if this resolve the error you get?
I also tried that. Still getting the same error.
I think I may find the issue. Some of the bin that appears in the sample has never appeared in the target, causing this error. Once I add the sample to the target, the bug disappear. I suggest the code take this edge case in consideration as well!
t2=pd.concat([s2,t2])
t2.reset_index(inplace=True)
t2['id']=t2.index.astype('str')
Great catch - thanks a bunch @geniusjenny !
O.k., I'll leave this issue open - and we'll get to add a proper exception in the future.
Thanks again.
Thank you!
I jump in the issue because I have the same problem. In my case I have no missing data in the target. I am trying to use the marginal distribution with rake. If there is no weight column in the "sample" dataframe and target_df_from_marginals'' then is automatically created with values equal to 1. Then, I tried to create the column "weight" for both: "target_df_from_marginals'' and the dataframe used to create "sample" but instead of use 1 used 1.0 so dtype - float and this time the error message is: AttributeError: 'numpy.float64' object has no attribute 'loc' Do you have any suggestions? @talgalili
Hey @EmanueleCeglia ,
Do you want to share the code you used?
My guess is that you need to add the weight column to the DataFrame of your data before using Sample.from_frame
so it will inherit from pandas the relevant methods.
df_sorted is a dataframe with two columns: ctrysize and ctrysect (they are sorted in alphabetical order) this is my df in which I have to calibrate weights. ctrysize is the combination of 12 EU countries and for each country the dimension of the firm size (from 1 to 4) ctrysect is the combination of 12 EU countries and for each country the sector of the firm (from A to D).
For each of these combinations I have the real totals in EU and I want to use these data as margins for the calibration. In the picture below you can see how I used the totals to create the dictionaire "a_dict_with_marginal_distributions" then Error Hope it's clear enough. In any case I can provide additional details. Thanks @talgalili
Hi @EmanueleCeglia
(please let's continue this discussion in the new bug you'll open - thanks)
Hi @talgalili yes the tutorial works perfectly
I am going to open a new issue so we can continue there
Describe the bug
The same code has no error when running method ='ipw', and method = 'cbps', but return below error when using raking. The below code return error
Update on 2023/03/08
This bug is returned because some of the bin that appears in the sample has never appeared in the target. Once I add the sample to the target to make sure all bins appear in the target, the bug disappear.
Session information
Please run paste here the output of running the following in your notebook/terminal:
balance 0.9.1 balance_functions NA boto3 1.28.28 dateutil 2.8.2 matplotlib 3.7.2 numpy 1.24.4 pandas 1.4.3 psutil 5.9.5 seaborn 0.12.2 session_info 1.0.0 tqdm 4.65.0
OpenSSL 23.2.0 PIL 10.0.0 anyio NA arrow 1.2.3 asttokens NA attr 23.1.0 attrs 23.1.0 babel 2.12.1 backcall 0.2.0 beta_ufunc NA binom_ufunc NA botocore 1.31.28 brotli NA certifi 2023.05.07 cffi 1.15.1 charset_normalizer 3.2.0 cloudpickle 2.2.1 colorama 0.4.4 comm 0.1.3 coxnet NA cryptography 41.0.2 cvcompute NA cvelnet NA cvfishnet NA cvglmnet NA cvglmnetCoef NA cvglmnetPredict NA cvlognet NA cvmrelnet NA cvmultnet NA cycler 0.10.0 cython_runtime NA debugpy 1.6.7 decorator 5.1.1 defusedxml 0.7.1 elnet NA executing 1.2.0 fastjsonschema NA fishnet NA fqdn NA fsspec 2023.6.0 glmnet NA glmnetCoef NA glmnetControl NA glmnetPredict NA glmnetSet NA glmnet_python NA google NA hypergeom_ufunc NA idna 3.4 ipfn NA ipykernel 6.24.0 ipython_genutils 0.2.0 ipywidgets 8.0.7 isoduration NA jedi 0.18.2 jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.1 json5 NA jsonpointer 2.4 jsonschema 4.18.4 jsonschema_specifications NA jupyter_events 0.6.3 jupyter_server 2.7.0 jupyterlab_server 2.23.0 kiwisolver 1.4.4 loadGlmLib NA lognet NA markupsafe 2.1.3 matplotlib_inline 0.1.6 mpl_toolkits NA mrelnet NA nbformat 5.9.1 nbinom_ufunc NA ncf_ufunc NA overrides NA packaging 21.3 parso 0.8.3 patsy 0.5.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 3.9.1 plotly 5.15.0 prometheus_client NA prompt_toolkit 3.0.39 ptyprocess 0.7.0 pure_eval 0.2.2 pyarrow 12.0.1 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.15.1 pyparsing 3.0.9 pythonjsonlogger NA pytz 2023.3 referencing NA requests 2.31.0 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rpds NA s3fs 0.4.2 scipy 1.9.1 send2trash NA six 1.16.0 sklearn 1.3.0 sniffio 1.3.0 socks 1.7.1 stack_data 0.6.2 statsmodels 0.14.0 tenacity NA threadpoolctl 3.2.0 tornado 6.3.2 traitlets 5.9.0 typing_extensions NA uri_template NA urllib3 1.26.14 wcwidth 0.2.6 webcolors 1.13 websocket 1.6.1 wtmean NA yaml 6.0 zmq 25.1.0
IPython 8.14.0 jupyter_client 8.3.0 jupyter_core 5.3.1 jupyterlab 4.0.3 notebook 6.5.4
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0] Linux-5.10.209-198.812.amzn2.x86_64-x86_64-with-glibc2.26
Session information updated at 2024-03-05 04:21
Screenshots
If applicable, add screenshots to help explain your problem.
Reproducible example
Please provide us with (any that apply):
Code: code we can run to reproduce the issue (in terminal or python notebook)
sample = Sample.from_frame(sample_df2[:50]) target = Sample.from_frame(target_df2[:500]) sample_with_target = sample.set_target(target) adjusted_ads_weight = sample_with_target.adjust(method = "rake",variables = variables_subset2)
sample_df2 and target_df2 are dataframes with two numerical columns.Reference: If the issue is in a tutorial, please provide the link to it, and the exact place in which the code fails.
Additional context
Add any other context about the problem here that might help us solve it.