DataResponsibly / DataSynthesizer

MIT License
257 stars 85 forks source link

describe_dataset_in_correlated_attribute_mode doesn't work in Python 3.11 #40

Open artemgur opened 1 year ago

artemgur commented 1 year ago

Description

In Python 3.11, describe_dataset_in_correlated_attribute_mode raises ValueError. And in Python 3.10, the same code with the same versions of dependencies works correctly.

At the same time, describe_dataset_in_independent_attribute_mode and describe_dataset_in_random_mode work correctly in Python 3.11.

Pandas version is 1.5.3, and not the latest 2.0.3, as describe_dataset_in_correlated_attribute_mode additionally doesn't work with Pandas 2.0.3 (I will write a separate issue on that later).

What I Did

from DataSynthesizer.DataDescriber import DataDescriber

describer = DataDescriber()
describer.describe_dataset_in_correlated_attribute_mode(dataset_file=input_data, k=2, epsilon=0)
describer.save_dataset_description_to_file(description_file)

When the code is ran, following happens: 1) "================ Constructing Bayesian Network (BN) ================" is printed (at least in Jupyter Notebook) 2) Following exception is raised: "ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

Traceback:

ValueError                                Traceback (most recent call last)
Cell In[22], line 8
      6 describer = DataDescriber()
      7 #TODO k parameter
----> 8 describer.describe_dataset_in_correlated_attribute_mode(dataset_file=input_data,
      9                                                         k=2,
     10                                                         epsilon=0)
     11                                                         #seed=random_state,
     12                                                         #attribute_to_is_categorical=categorical_attributes)
     13 describer.save_dataset_description_to_file(description_file)

File ~\.virtualenvs\DataSynthesizerTest311\Lib\site-packages\DataSynthesizer\DataDescriber.py:177, in DataDescriber.describe_dataset_in_correlated_attribute_mode(self, dataset_file, k, epsilon, attribute_to_datatype, attribute_to_is_categorical, attribute_to_is_candidate_key, categorical_attribute_domain_file, numerical_attribute_ranges, seed)
    174 if self.df_encoded.shape[1] < 2:
    175     raise Exception("Correlated Attribute Mode requires at least 2 attributes(i.e., columns) in dataset.")
--> 177 self.bayesian_network = greedy_bayes(self.df_encoded, k, epsilon / 2, seed=seed)
    178 self.data_description['bayesian_network'] = self.bayesian_network
    179 self.data_description['conditional_probabilities'] = construct_noisy_conditional_distributions(
    180     self.bayesian_network, self.df_encoded, epsilon / 2)

File ~\.virtualenvs\DataSynthesizerTest311\Lib\site-packages\DataSynthesizer\lib\PrivBayes.py:145, in greedy_bayes(dataset, k, epsilon, seed)
    142 attr_to_is_binary = {attr: dataset[attr].unique().size <= 2 for attr in dataset}
    144 print('================ Constructing Bayesian Network (BN) ================')
--> 145 root_attribute = random.choice(dataset.columns)
    146 V = [root_attribute]
    147 rest_attributes = list(dataset.columns)

File C:\Python311\Lib\random.py:369, in Random.choice(self, seq)
    367 def choice(self, seq):
    368     """Choose a random element from a non-empty sequence."""
--> 369     if not seq:
    370         raise IndexError('Cannot choose from an empty sequence')
    371     return seq[self._randbelow(len(seq))]

File ~\.virtualenvs\DataSynthesizerTest311\Lib\site-packages\pandas\core\indexes\base.py:3188, in Index.__nonzero__(self)
   3186 @final
   3187 def __nonzero__(self) -> NoReturn:
-> 3188     raise ValueError(
   3189         f"The truth value of a {type(self).__name__} is ambiguous. "
   3190         "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   3191     )

ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
haoyueping commented 1 year ago

Hi @artemgur, I cannot replicate this error. In your error message, line 145 only raises errors when dataset.columns is empty, i.e., there are no categorical or numerical columns in the input dataset.

--> 145 root_attribute = random.choice(dataset.columns)

Please double-check if this is the case.

DataSynthesizer is just updated to 0.1.12. Please feel free to test it out.