Numpy error when synthesising data with unique identifiers

raids commented 4 years ago

DataSynthesizer version: Version: 0.1.0
Python version: Python 3.8.2
Operating System: MacOS with pyenv

Description

I have a CSV with ~20 columns, 3 of which are unique identifiers. DataSynthesizer seems to be tripping up on these 3 columns with the error below. What's the expected behaviour when trying to include UUIDs (or similar) in the synthesise? The field is not labelled as categorical and is of datatype String.

What I Did

describer = DataDescriber()
generator = DataGenerator()

describer.describe_dataset_in_random_mode(
    dataset_file='input.csv',
    attribute_to_datatype=attribute_to_datatype,
    attribute_to_is_categorical=attribute_is_categorical,
    )
describer.save_dataset_description_to_file('description.csv'),
    )
generator.generate_dataset_in_random_mode(
    n,
    'output.csv',
    )

Traceback (most recent call last):
  File "synthesise/synthesise.py", line 106, in <module>
    main()
  File "synthesise/synthesise.py", line 86, in main
    generator.generate_dataset_in_correlated_attribute_mode(
  File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/DataSynthesizer/DataGenerator.py", line 72, in generate_dataset_in_correlated_attribute_mode
    self.synthetic_dataset[attr] = column.generate_values_as_candidate_key(n)
  File "/Users/raids/.pyenv/versions/data-synthesizer/lib/python3.8/site-packages/DataSynthesizer/datatypes/StringAttribute.py", line 52, in generate_values_as_candidate_key
    length = np.random.randint(self.min, self.max)
  File "mtrand.pyx", line 745, in numpy.random.mtrand.RandomState.randint
  File "_bounded_integers.pyx", line 1254, in numpy.random._bounded_integers._rand_int64
ValueError: low >= high

Let me know if you need further info or want me to try anything out.

Thanks

haoyueping commented 4 years ago

Thanks for your feedback. This bug should be fixed by commit 9c3fb3508cddfc1cbbfeee76eccf41a672d92fad now in the latest DataSynthesizer 0.1.1

raids commented 4 years ago

Looking good, thanks!

DataResponsibly / DataSynthesizer

Numpy error when synthesising data with unique identifiers #23

Description

What I Did