diffix / syndiffix

Python implementation of the SynDiffix synthetic data generation mechanism.
Other
6 stars 2 forks source link

math.sqrt failure in _random_normal() #136

Closed yoid2000 closed 2 months ago

yoid2000 commented 7 months ago

I'm getting a failure at https://github.com/diffix/syndiffix/blob/4dcdbfa375d7c8bb707cf17335a8741845a60c80/syndiffix/anonymizer.py#L44

The problem is that u1 = 0, which causes math.log(u1) to crash.

The reason that u1 = 0 is because the seed in https://github.com/diffix/syndiffix/blob/4dcdbfa375d7c8bb707cf17335a8741845a60c80/syndiffix/anonymizer.py#L42

is 81,486,552,638,685,184 = 0x1217F9680000000

Since the trailing 31 bits are zero, this causes u1 to be 0.

This seems too low probability to be happening just randomly (2^31 is like 2 billion) so there might be some upstream reason for the problem, but I didn't see anything obvious.

My inclination is to just add a check for the failing value and replace it with something safe. But maybe @cristianberneanu you have a better idea?

Here is the dump:

Traceback (most recent call last):
  File "/INS/syndiffix/work/paul/syndiffix-paper-tests/buildSdxDataset.py", line 24, in <module>
    tm.synthesize(columns=job['cols'], also_save_stats=True)
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix_tools/tables_manager.py", line 133, in synthesize
    syn = Synthesizer(self.df_orig[columns], pids=df_pid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/synthesizer.py", line 100, in __init__
    self.clusters, self.entropy_1dim = clustering.build_clusters(self.forest)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/clustering/strategy.py", line 89, in build_clusters
    clustering_context = _clustering_context(main_column=main_column, forest=sampled_forest)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/clustering/strategy.py", line 10, in _clustering_context
    scores = measures.measure_all(forest)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/clustering/measures.py", line 186, in measure_all
    score = measure_dependence(forest, col_x, col_y)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/clustering/measures.py", line 160, in measure_dependence
    walk(root_xy, root_x, root_y)
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/clustering/measures.py", line 152, in walk
    walk(child_xy, child_x, child_y)
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/clustering/measures.py", line 152, in walk
    walk(child_xy, child_x, child_y)
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/clustering/measures.py", line 152, in walk
    walk(child_xy, child_x, child_y)
  [Previous line repeated 4 more times]
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/clustering/measures.py", line 98, in walk
    actual_2dim_count = node_xy.noisy_count() if node_xy else 0.0
                        ^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/tree.py", line 121, in noisy_count
    self._noisy_count_cache = max(row_counter.noisy_count(anon_context), min_count)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/counters.py", line 74, in noisy_count
    result = count_multiple_contributions(context, self.contributions_list)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/anonymizer.py", line 268, in count_multiple_contributions
    flattened_contributions = [_flatten_contributions(contributions, context) for contributions in contributions_list]
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/anonymizer.py", line 268, in <listcomp>
    flattened_contributions = [_flatten_contributions(contributions, context) for contributions in contributions_list]
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/anonymizer.py", line 170, in _flatten_contributions
    noise = _generate_noise(anon_params.salt, "noise", noise_sd, (context.bucket_seed, pid_seed))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/anonymizer.py", line 73, in _generate_noise
    noise += _random_normal(sd, _mix_seed(step_name, _crypto_hash_salted_seed(salt, layer_seed)))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/INS/syndiffix/work/paul/sdx_tests/sdx_venv/lib/python3.11/site-packages/syndiffix/anonymizer.py", line 44, in _random_normal
    normal = math.sqrt(-2.0 * math.log(u1)) * math.sin(2.0 * math.pi * u2)
                              ^^^^^^^^^^^^
ValueError: math domain error
cristianberneanu commented 7 months ago

Adding a check there like

u1 = max(u1, sys.float_info.epsilon)

makes most sense to me. Even if the probability of that specific value happening is low, that case still needs to be handled.

Right now, I don't see a reason why such values should be suspicious. The hash itself is not empty, so maybe there weren't enough bytes in the seed materials to produce anything more complex than that.

yoid2000 commented 7 months ago

Ok. That is what I had in mind (except that I didn't know about sys.float_info.epsilon).

Thanks.

yoid2000 commented 2 months ago

CLosed by #141