DataResponsibly / DataSynthesizer

MIT License
252 stars 85 forks source link

keyerror with higher degrees #21

Open lpkoh opened 4 years ago

lpkoh commented 4 years ago

Hi,

Thank you so much for this! It's been a life saver. I got your model to run on one of my datasets, but I ran into a problem with higher degrees. With k = 2 and k = 3 models on my dataset, the code ran without bugs at several epsilons up to 2.5, but with k = 4 and higher, for all epsilons, this runs:

================ Constructing Bayesian Network (BN) ================ Adding ROOT accrued_holidays Adding attribute org Adding attribute office Adding attribute start_date Adding attribute bonus Adding attribute birth_date Adding attribute salary Adding attribute title Adding attribute gender ========================== BN constructed ========================== But then the cell just freezes there until keyError (6,5,0,0) occurs

haoyueping commented 4 years ago

Hi, thanks for your feedback. In terms of the keyError, can you provide more information? For example, if this keyError still gets raised when epsilon=0? And the complete error report from the python interpreter? Or any information that may be helpful.

lpkoh commented 4 years ago

Hmmm very strange. Now I try to recreate the error and it seems to run. However, it takes several hours for one degree 4 Bayesian Network. Is this normal?

Here is another error I faced when trying to recreate the error. Epsilon = 1 and degree of Bayesian Network = 5 ` ================ Constructing Bayesian Network (BN) ================ Adding ROOT accrued_holidays Adding attribute org Adding attribute office Adding attribute birth_date Adding attribute bonus Adding attribute start_date Adding attribute salary Adding attribute title Adding attribute gender ========================== BN constructed ==========================

TypeError Traceback (most recent call last)

in 6 k=degree_of_bayesian_network, 7 attribute_to_is_categorical=categorical_attributes, ----> 8 attribute_to_is_candidate_key=candidate_keys) 9 describer.save_dataset_description_to_file(description_file + '_' + \ 10 str(epsilon) + '_' +\ ~\Desktop\Tonic\CodeAndData\CTGAN_TGAN_PB_tests\DataSynthesizer-master/DataSynthesizer\DataDescriber.py in describe_dataset_in_correlated_attribute_mode(self, dataset_file, k, epsilon, attribute_to_datatype, attribute_to_is_categorical, attribute_to_is_candidate_key, categorical_attribute_domain_file, numerical_attribute_ranges, seed) 178 self.data_description['bayesian_network'] = self.bayesian_network 179 self.data_description['conditional_probabilities'] = construct_noisy_conditional_distributions( --> 180 self.bayesian_network, self.df_encoded, epsilon / 2) 181 182 def read_dataset_from_csv(self, file_name=None): ~\Desktop\Tonic\CodeAndData\CTGAN_TGAN_PB_tests\DataSynthesizer-master/DataSynthesizer\lib\PrivBayes.py in construct_noisy_conditional_distributions(bayesian_network, encoded_dataset, epsilon) 271 else: 272 for parents_instance in product(*stats.index.levels[:-1]): --> 273 dist = normalize_given_distribution(stats.loc[parents_instance]['count']).tolist() 274 conditional_distributions[child][str(list(parents_instance))] = dist 275 ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1760 except (KeyError, IndexError, AttributeError): 1761 pass -> 1762 return self._getitem_tuple(key) 1763 else: 1764 # we by definition only have the 0th axis ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup) 1270 def _getitem_tuple(self, tup: Tuple): 1271 try: -> 1272 return self._getitem_lowerdim(tup) 1273 except IndexingError: 1274 pass ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup) 1419 return section 1420 # This is an elided recursive call to iloc/loc/etc' -> 1421 return getattr(section, self.name)[new_key] 1422 1423 raise IndexingError("not applicable") ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1760 except (KeyError, IndexError, AttributeError): 1761 pass -> 1762 return self._getitem_tuple(key) 1763 else: 1764 # we by definition only have the 0th axis ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup) 1270 def _getitem_tuple(self, tup: Tuple): 1271 try: -> 1272 return self._getitem_lowerdim(tup) 1273 except IndexingError: 1274 pass ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup) 1371 # we may have a nested tuples indexer here 1372 if self._is_nested_tuple_indexer(tup): -> 1373 return self._getitem_nested_tuple(tup) 1374 1375 # we maybe be using a tuple to represent multiple dimensions here ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_nested_tuple(self, tup) 1451 1452 current_ndim = obj.ndim -> 1453 obj = getattr(obj, self.name)._getitem_axis(key, axis=axis) 1454 axis += 1 1455 ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1962 1963 # fall thru to straight lookup -> 1964 self._validate_key(key, axis) 1965 return self._get_label(key, axis=axis) 1966 ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis) 1829 1830 if not is_list_like_indexer(key): -> 1831 self._convert_scalar_indexer(key, axis) 1832 1833 def _is_scalar_access(self, key: Tuple) -> bool: ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexing.py in _convert_scalar_indexer(self, key, axis) 739 ax = self.obj._get_axis(min(axis, self.ndim - 1)) 740 # a scalar --> 741 return ax._convert_scalar_indexer(key, kind=self.name) 742 743 def _convert_slice_indexer(self, key: slice, axis: int): ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexes\base.py in _convert_scalar_indexer(self, key, kind) 2885 elif kind in ["loc"] and is_integer(key): 2886 if not self.holds_integer(): -> 2887 self._invalid_indexer("label", key) 2888 2889 return key ~\Anaconda3\envs\myenv\lib\site-packages\pandas\core\indexes\base.py in _invalid_indexer(self, form, key) 3074 """ 3075 raise TypeError( -> 3076 f"cannot do {form} indexing on {type(self)} with these " 3077 f"indexers [{key}] of {type(key)}" 3078 ) TypeError: cannot do label indexing on with these indexers [5] of `
haoyueping commented 4 years ago

Hi, the error is traced to normalize_given_distribution in utils.py. Can you modify this function as follows, and let me know the input frequencies value raising this error?

def normalize_given_distribution(frequencies):
    try:
        distribution = np.array(frequencies, dtype=float)
        distribution = distribution.clip(0)  # replace negative values with 0
        summation = distribution.sum()
        if summation > 0:
            if np.isinf(summation):
                return normalize_given_distribution(np.isinf(distribution))
            else:
                return distribution / summation
        else:
            return np.full_like(distribution, 1 / distribution.size)
    except:
        raise Exception(f'An error happens when frequencies={frequencies}') 
oregonpillow commented 4 years ago

I'm also getting key errors.

`================ Constructing Bayesian Network (BN) ================ Adding ROOT workclass Adding attribute race Adding attribute sex Adding attribute education Adding attribute capital-gain Adding attribute education-num Adding attribute marital-status Adding attribute occupation Adding attribute relationship Adding attribute age Adding attribute fnlwgt Adding attribute hours-per-week Adding attribute capital-loss Adding attribute native-country Adding attribute income ========================== BN constructed ==========================

KeyError Traceback (most recent call last)

in () 4 k=degree_of_bayesian_network, 5 attribute_to_is_categorical=categorical_attributes, ----> 6 attribute_to_is_candidate_key=candidate_keys) 7 describer.save_dataset_description_to_file(description_file) 11 frames /content/gdrive/My Drive/DataSynthesizer/DataSynthesizer/DataDescriber.py in describe_dataset_in_correlated_attribute_mode(self, dataset_file, k, epsilon, attribute_to_datatype, attribute_to_is_categorical, attribute_to_is_candidate_key, categorical_attribute_domain_file, numerical_attribute_ranges, seed) 178 self.data_description['bayesian_network'] = self.bayesian_network 179 self.data_description['conditional_probabilities'] = construct_noisy_conditional_distributions( --> 180 self.bayesian_network, self.df_encoded, epsilon / 2) 181 182 def read_dataset_from_csv(self, file_name=None): /content/gdrive/My Drive/DataSynthesizer/DataSynthesizer/lib/PrivBayes.py in construct_noisy_conditional_distributions(bayesian_network, encoded_dataset, epsilon) 271 else: 272 for parents_instance in product(*stats.index.levels[:-1]): --> 273 dist = normalize_given_distribution(stats.loc[parents_instance]['count']).tolist() 274 conditional_distributions[child][str(list(parents_instance))] = dist 275 /usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in __getitem__(self, key) 1760 except (KeyError, IndexError, AttributeError): 1761 pass -> 1762 return self._getitem_tuple(key) 1763 else: 1764 # we by definition only have the 0th axis /usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _getitem_tuple(self, tup) 1270 def _getitem_tuple(self, tup: Tuple): 1271 try: -> 1272 return self._getitem_lowerdim(tup) 1273 except IndexingError: 1274 pass /usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup) 1378 # instead of checking it as multiindex representation (GH 13797) 1379 if isinstance(ax0, ABCMultiIndex) and self.name != "iloc": -> 1380 result = self._handle_lowerdim_multi_index_axis0(tup) 1381 if result is not None: 1382 return result /usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _handle_lowerdim_multi_index_axis0(self, tup) 1358 # else IndexingError will be raised 1359 if len(tup) <= self.obj.index.nlevels and len(tup) > self.ndim: -> 1360 raise ek 1361 1362 return None /usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _handle_lowerdim_multi_index_axis0(self, tup) 1350 try: 1351 # fast path for series or for tup devoid of slices -> 1352 return self._get_label(tup, axis=axis) 1353 except TypeError: 1354 # slices are unhashable /usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _get_label(self, label, axis) 623 raise IndexingError("no slices here, handle elsewhere") 624 --> 625 return self.obj._xs(label, axis=axis) 626 627 def _get_loc(self, key: int, axis: int): /usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level) 3533 index = self.index 3534 if isinstance(index, MultiIndex): -> 3535 loc, new_index = self.index.get_loc_level(key, drop_level=drop_level) 3536 else: 3537 loc = self.index.get_loc(key) /usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py in get_loc_level(self, key, level, drop_level) 2816 raise KeyError(key) from e 2817 else: -> 2818 return partial_selection(key) 2819 else: 2820 indexer = None /usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py in partial_selection(key, indexer) 2803 def partial_selection(key, indexer=None): 2804 if indexer is None: -> 2805 indexer = self.get_loc(key) 2806 ilevels = [ 2807 i for i in range(len(key)) if key[i] != slice(None, None) /usr/local/lib/python3.6/dist-packages/pandas/core/indexes/multi.py in get_loc(self, key, method) 2683 2684 if start == stop: -> 2685 raise KeyError(key) 2686 2687 if not follow_key: KeyError: (3, 1, 12, 1, 1)`
haoyueping commented 4 years ago

Please check out the latest code (commit 9f476eb00c492ad7af7da78fbc606bc776d11840), and see if this KeyError is fixed.

hamzanaeem1999 commented 3 years ago

@haoyueping How do I choose the value of K for constructing a bayesian network because my csv file contains 40 attributes

haoyueping commented 3 years ago

@hamzanaeem1999 In theory, a higher value of k makes the Bayesian network more accurate, while a lower value of k reduces the time and space complexity. In practice, you can start with k = 1 and gradually increase k until you find a proper k.

hamzanaeem1999 commented 3 years ago

Thanks , kindly answer one more question that If my data is already in numerical form , should there is a need of Categorical attributes to be used . What about epsilon should i increase it too ?

On Thu, 25 Mar 2021, 23:44 Haoyue Ping, @.***> wrote:

@hamzanaeem1999 https://github.com/hamzanaeem1999 In theory, a higher value of k makes the Bayesian network more accurate, while a lower value of k reduces the time and space complexity. In practice, you can start with k = 1 and gradually increase k until you find a proper k.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataResponsibly/DataSynthesizer/issues/21#issuecomment-807265408, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4UDDGBOMVWLMPH5LXZSMLTFN773ANCNFSM4M5BN2RQ .

haoyueping commented 3 years ago

@hamzanaeem1999 DataSynthesizer works the best for categorical attributes. When it handles numerical values, it uses histograms to model the distribution, so it won't be accurate within each bin of the histogram.

A greater epsilon value corresponds to less noise. So you need to try different epsilon values to make a tradeoff between privacy and utility.

hamzanaeem1999 commented 3 years ago

Then why use use only Education for categorical in your git while there are other columns too which are in categorical !

On Fri, 26 Mar 2021, 00:32 Haoyue Ping, @.***> wrote:

@hamzanaeem1999 https://github.com/hamzanaeem1999 DataSynthesizer works the best for categorical attributes. When it handles numerical values, it uses histograms to model the distribution, so it won't be accurate within each bin of the histogram.

A greater epsilon value corresponds to less noise. So you need to try different epsilon values to make a tradeoff between privacy and utility.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataResponsibly/DataSynthesizer/issues/21#issuecomment-807339346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4UDDACICZMJWGEIEXK7QDTFOFTPANCNFSM4M5BN2RQ .

haoyueping commented 3 years ago

@hamzanaeem1999 DataSynthesizer identifies categorical attributes by the parameter category_threshold. You don't need to explicitly specify each categorical attribute whose domain size is smaller than category_threshold.