Recommendations for categorical variables with many distinct values?

DrAndiLowe commented 6 years ago

I have data that contains a dozen or so columns that contain categorical variables with many distinct values. When I try to run in correlated attribute mode, line 162 in PrivBayes.py results in MemoryError: I run out of RAM. Here's the line:

full_space = pd.DataFrame(columns=attributes, data=list(product(*stats.index.levels)))

Without knowing much about the workings of the code, it looks like it's taking cartesian product over all the columns in my data so that DataDescriber can learn the joint distributions of subsets of (categorical) data selected by the Bayesian network finding routine, some of which have thousands of distinct values, and consequently the line above is trying to generate massive tables that are maxing-out RAM. That's just a guess.

Other than running in independent attribute mode, do you have any recommendations for how to proceed when data contains many categorical variables with many distinct values? It's often not possible to merge values by binning them into a smaller number of distinct values. For example, my PlaceOfBirth column contains (city, country) combinations that result in a huge number of values, and it's neither meaningful to separate the city and country information, nor bin the data, and I'd like to retain the relationships to other columns in the data, such as Passport or Citizenship, because they are meaningful. This is just an example of a general problem that I've encountered. One might often encounter data like this.

Is there a technical workaround, or is there some best practice that you can recommend for dealing with this situation --- how can I use your software optimally? Anything documented?

haoyueping commented 6 years ago

Thanks for giving us your feedback! This MemoryError is raised by constructing a large table for conditional distributions of some attribute. This error probably can be avoided by setting a very small number for degree_of_bayesian_network. Can I ask what is the current value you set for this parameter? Probably setting 1 or 2 for this parameter can avoid this out of memory error.

DrAndiLowe commented 6 years ago

My current value for degree_of_bayesian_network is 2. Probably changing this to 1 isn't going to help: I realise that I just have too many distinct values in my categorical variables to use the correlated attribute mode. I'll have to compromise and use independent attribute mode for just these columns only.

haoyueping commented 6 years ago

The code does not generate a big table for all categorical attributes. Instead, it generates a conditional probability table for each parents-child attribute pair.

Assume all categorical attributes in your dataset have the same domain size 200. When there are two parents for an attribute, its conditional probability table will cost 200^3=8M rows. If each attribute has only one parent, this conditional probability table will cost 200^2=40K rows.

So degree_of_bayesian_network =1 probably works. If so, it will be much better than independent attribute mode.

DrAndiLowe commented 6 years ago

Thanks! I'll try it, but it turns out that for some categorical columns the number of distinct values is almost equal to the number of rows. I'm testing on 10k rows, but in production the full data is 3M rows.

DataResponsibly / DataSynthesizer

Recommendations for categorical variables with many distinct values? #9