facebookresearch / balance

The balance python package offers a simple workflow and methods for dealing with biased data samples when looking to infer from them to some target population of interest.
https://import-balance.org
GNU General Public License v2.0
687 stars 42 forks source link

[BUG] marginal distribution with rake #80

Open EmanueleCeglia opened 6 months ago

EmanueleCeglia commented 6 months ago

While attempting to calibrate the margins of a sample derived from a survey (df dataframe), I encounter the error displayed at the end of the code flow.

The margins used for calibration are real totals of country x size and country x sector, in the same order as obtained through the command sorted(set(df['ctrysize/sect'])).

image image image image image image image image
talgalili commented 6 months ago

Hi @EmanueleCeglia Thanks, for the report. Any chance you could prepare a sample data (self contained code, no files) that could somehow reproduce your issue? I'd like/need to run it locally to be able to reproduce and fix.

Thanks.

EmanueleCeglia commented 6 months ago

Hi @talgalili I didn't know how to do. I created a public repository where you can run the code by yourself and see the bug. https://github.com/EmanueleCeglia/marginal-distribution-with-rake.git I hope it's fine for you.

Thanks :)

EmanueleCeglia commented 6 months ago

@talgalili Hi, sorry if I bother you. Do you have some news about the bug? If the repository is not fine for you we can find another solution. Best regards, Emanuele

talgalili commented 6 months ago

Hi @EmanueleCeglia The simplest solution for me to work with would be code that I can run (without external files) that can reproduce the problem. You can use .to_list() on a DaraFrame to create such a piece of code, and then use pd.DataFrame(the_list) to get it into a DataFrame. The challenge for you is to create the smallest minimal situation that reproduces the issue (so that the code you paste won't be too long). Could you try and do that?

Thanks!

EmanueleCeglia commented 6 months ago

Hi @talgalili I understand, I try to do this as soon as possible and I will come back to you. Thanks a lot for your availability. Best, Emanuele

crispy-wonton commented 6 months ago

Hi @talgalili and @EmanueleCeglia , We ran into a similar issue recently. Ours stemmed from the ipfn package. We forked the ipfn repo with a fix - see here: https://github.com/Dirguis/ipfn/compare/master...nestauk:ipfn:master It seems like this error occurs when using rake with pandas df when you have only one instance of a particular feature category in your sample dataframe. If you have 1 row for a category, it gets converted into numpy array when you .loc for that category. The error has something to do with this .loc process going wrong with numpy array because of some kind of recursiveness (?) I think.

talgalili commented 6 months ago

Thanks for this! Could you please propose a PR for me to review?

On Tue, 21 May 2024, 11:41 Roisin, @.***> wrote:

Hi @talgalili https://github.com/talgalili and @EmanueleCeglia https://github.com/EmanueleCeglia , We ran into the same issue recently and forked the repo with a fix - see here: @.***:ipfn:master https://github.com/Dirguis/ipfn/compare/master...nestauk:ipfn:master It seems like this error occurs when using rake with pandas df when you have only one instance of a particular feature category in your sample dataframe. If you have 1 row for a category, it gets converted into numpy array when you .loc for that category. The error has something to do with this .loc process going wrong with numpy array because of some kind of recursiveness (?) I think.

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/balance/issues/80#issuecomment-2122326375, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHOJBQG7ABHRBLL55NUYN3ZDMQGDAVCNFSM6AAAAABHRLDUZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGMZDMMZXGU . You are receiving this because you were mentioned.Message ID: @.***>

talgalili commented 6 months ago

Oh, I now see that this is a bug in ipfn (not in balance).

I think it's possible to fix this issue in balance using a monkey patch. Like was done here: https://github.com/facebookresearch/balance/blob/cf22b9f2b008f9d57f98a376fb148ddcc888f060/balance/weighting_methods/ipw.py#L32 (until ipfn fixes the issue)

@crispy-wonton do you want to try a PR on adding this hack to balance? (or do you think it's easier to redirect the installation to just use your repo, WDYT?)

EmanueleCeglia commented 6 months ago

Hi @talgalili @crispy-wonton thanks for your feedback. I tried these combination: 1: remove categories that presents only one observation (and also related margins) -> usual error 2: update ipfn.py file with recommended changes (keeping all categories) -> usual error 3: update ipfn.py file and remove categories that presents only one observation (and also related margins) -> works

Now the only thing that I have to explore is why some categories are grouped together and so at the end they are not balanced.

INFO (2024-05-21 16:30:13,119) [rake/rake (line 154)]: Final covariates and levels that will be used in raking: {'ctrysize': ['_lumped_other', 'DE4', 'DE3', 'DE2', 'FR4', 'DE1', 'IT1'], 'ctrysect': ['_lumped_other', 'ESC', 'DEB', 'FRC', 'ITC', 'DEC', 'DED']}.

image

talgalili commented 6 months ago

Thank you for the update! Your checks leave me confused. I don't understand why using both solutions is the only thing that works. Do you have any guesses?

On Tue, 21 May 2024, 15:41 Emanuele Ceglia, @.***> wrote:

Hi @talgalili https://github.com/talgalili @crispy-wonton https://github.com/crispy-wonton thanks for your feedback. I tried these combination: 1: remove categories that presents only one observation (and also related margins) -> usual error 2: update ipfn.py file with recommended changes (keeping all categories) -> usual error 3: update ipfn.py file and remove categories that presents only one observation (and also related margins) -> works

Now the only thing that I have to explore is why some categories are grouped together and so at the end they are not balanced.

INFO (2024-05-21 16:30:13,119) [rake/rake (line 154)]: Final covariates and levels that will be used in raking: {'ctrysize': ['_lumped_other', 'DE4', 'DE3', 'DE2', 'FR4', 'DE1', 'IT1'], 'ctrysect': ['_lumped_other', 'ESC', 'DEB', 'FRC', 'ITC', 'DEC', 'DED']}.

image.png (view on web) https://github.com/facebookresearch/balance/assets/99983605/27680136-5a28-456f-a79b-9912fdecb8f0

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/balance/issues/80#issuecomment-2122795878, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHOJBUQKEW6FK3Y2H66BKDZDNMJ5AVCNFSM6AAAAABHRLDUZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSG44TKOBXHA . You are receiving this because you were mentioned.Message ID: @.***>

EmanueleCeglia commented 6 months ago

Hi @talgalili here I am for few updates, the library now doesn't give me any error even if I am keeping those categories that present only one observation. The ipfn.py file is updated with recommended changes explained in previous messages. So, maybe last time I was doing something wrong.

In order to avoid _lumped_other (categories grouped together in a generic one) I also changed other parameters inside the library:

I still have a problem: I need to balance two categories inside my dataset: ctrysize and ctrysect but after the calibration only the first one is correctly balanced with the finals weights.

newplot newplot2