almost-matching-exactly / DAME-FLAME-Python-Package

A Python Package providing two algorithms, DAME and FLAME, for fast and interpretable treatment-control matches of categorical data
https://almost-matching-exactly.github.io/DAME-FLAME-Python-Package/
MIT License
56 stars 14 forks source link

Parallelize the for loop for possible drops in flame_algorithm.py and fix an overflow issue in flame_grouped_by.py #69

Open wtc100 opened 6 months ago

nehargupta commented 6 months ago

Hi @wtc100 Thanks so much for all your work on this!

I'm interested/concerned to know that you mention an overflow issue? Can I ask a bit about your suggested change to flame_group_by? I see you recommend calculating the required precision in each operation -- wouldn't this be time costly? I see you then select the datatype as int64 or Decimal based on size -- could we instead always do the larger one? My concern is just whether introducing extra operations/if statements here for the size might degrade time performance, but not sure if you recommend preserving space constraints when possible for some reason? Also -- I noticed you changed my python array to np.arange, which I thought defaulted to a particular data type -- this is why I stopped using it I think, to avoid overflow to begin with.

Secondly. Did you have the chance to checkout our database version of these algorithms? Those are much much faster and wouldn't require the parallelization changes. They're not yet integrated with the main branch due to documentation needs but they're very well tested for accuracy. The code is here: https://github.com/almost-matching-exactly/DAME-FLAME-Python-Package/tree/2d941bcfa76d7bcd33d58cbf4657202e62cc5b0c (flame_db folder) and the documentation is here: https://github.com/almost-matching-exactly/DAME-FLAME-Python-Package?tab=readme-ov-file#a-tutorial-to-flame-database-version

Thanks again, happy for your interest and engagement with the package!

wtc100 commented 5 months ago

Hi @nehargupta,

Thank you for the response!

The overflow issue comes in when computing the bit vectors with many covariates. For example, for a 64-bit integer data type, if we have a dataset with h-ary categorical data (Let’s say h=4.) and 40 covariates, then the bit vectors may overflow. The overflow issue will cause problem for the pandas groupby function in grouped_mr.py. Somehow groupby(‘b_i’) and then aggregation give different results every time when overflow happens. To deal with the overflow issue, I introduce the Decimal module with a small overhead of computing the precisions. The extra computations should take no time since they are just simple math operations. However, if we always do the larger one, then for datasets with less covariates where no overflow happens, it will be slow since computation in Decimal is slower than int64.

Since the overflow issue is not induced by the np.arange function, I believe it is not a concern. But please feel free to fall back to the original numpy array.

I have only tried the main version but not the database version. For now, I am satisfied with the performance of the pandas + decimal + parallel version. I will leave it to you to decide if you want to merge the PR or how much you want to merge. In the meantime, I am interested in trying out the database version when I have the resources and look forward to its integration to the main branch!

Thanks again for providing such a useful package!