DataResponsibly / DataSynthesizer

MIT License
257 stars 85 forks source link

Dates before 1970-01-01 cause crash #10

Closed DrAndiLowe closed 6 years ago

DrAndiLowe commented 6 years ago

In DateTimeAttribute.py, line 65:

timestamps = self.data_dropna.map(lambda x: parse(x).timestamp())

timestamp() results in a crash for dates earlier than 1970:

Traceback (most recent call last):
  File "C:\Users\ANDREW~1\AppData\Local\Temp\RtmpuMDjSt\chunk-code-2b143b894699.txt", line 131, in <module>
    describer.describe_dataset_in_correlated_attribute_mode(input_data, epsilon = epsilon, k = degree_of_bayesian_network, attribute_to_is_categorical = categorical_attributes, attribute_to_is_candidate_key = candidate_keys)
  File ".\DataSynthesizer\DataDescriber.py", line 123, in describe_dataset_in_correlated_attribute_mode
    seed)
  File ".\DataSynthesizer\DataDescriber.py", line 88, in describe_dataset_in_independent_attribute_mode
    self.infer_domains()
  File ".\DataSynthesizer\DataDescriber.py", line 242, in infer_domains
    column.infer_domain(self.input_dataset[column.name])
  File ".\DataSynthesizer\datatypes\DateTimeAttribute.py", line 56, in infer_domain
    timestamps = self.data_dropna.map(lambda x: parse(x).timestamp())
  File "C:\Users\Andrew_Lowe\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py", line 2354, in map
    new_values = map_f(values, arg)
  File "pandas/_libs/src/inference.pyx", line 1521, in pandas._libs.lib.map_infer
  File ".\DataSynthesizer\datatypes\DateTimeAttribute.py", line 56, in <lambda>
    timestamps = self.data_dropna.map(lambda x: parse(x).timestamp())
OSError: [Errno 22] Invalid argument

This is apparently a known Python bug: see this Stack Overflow post.

If the timestamp is out of the range of values supported by the platform C localtime() or gmtime() functions, datetime.fromtimestamp() may raise an exception like you're seeing. On Windows platform, this range can sometimes be restricted to years in 1970 through 2038. I have never seen this problem on a Linux system.

The same problem seems to occur with timestamp(); I tried this from a Python command prompt:

>>> from dateutil.parser import parse
>>> parse('19/04/1979').timestamp()
293320800.0
>>> parse('19/04/1969').timestamp()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

If you're not seeing this behaviour, the SO post hints that Windows systems are affected, but not Linux.

Is there way to replace the translation from dates to timestamps, and vice versa, with code that works for dates earlier than 1970-01-01?

DrAndiLowe commented 6 years ago

Also, see here for info on this bug: https://bugs.python.org/issue29097

DrAndiLowe commented 6 years ago

Not too sure how datetime values are treated. I tried a workaround of converting all dates to Unix timestamps, negative values included, but the results were terrible; converting back to dates gets me a much much smaller range of dates than in the input data. So that's not a solution. Somehow I need dates before 1970 to be treated properly. Any ideas?

haoyueping commented 6 years ago

For a workaround solution, if the datetime values are dates, you can first convert them to integers. After generating synthetic dataset, convert integers back to dates.

>>> from dateutil.parser import parse
>>> date0 = parse('01/01/1970')
>>> date1 = parse('19/04/1979')
>>> date2 = parse('19/04/1969')
>>> date3 = parse('06/08/2018')
>>> (date1-date0).days
3395
>>> (date2-date0).days
-257
>>> (date3-date0).days
17690
DrAndiLowe commented 6 years ago

With regards to my previous comment about the distributions of datetime attributes being synthesised poorly: implementing #11 resolves this issue after converting to integers as you suggested in your reply. That is, converting to integers wasn't the source of the behaviour I saw. Replacing timestamp() with a simple count of seconds from a user-defined epoch start will probably be sufficient to close this issue.