MIT-LCP / wfdb-python

Native Python WFDB package
MIT License
738 stars 300 forks source link

fixing rounding errors in to_dataframe in wfdb records #456

Open lamawmouk opened 1 year ago

lamawmouk commented 1 year ago

This pull request fixes issue #444.

This PR refactors the index generation logic in the to_dataframe method in BaseRecord class

The current code uses the freq argument in the pd.date_range and pd.timedelta_range functions to generate the index for the DataFrame. However, using end argument improves accuracy of the time without being off by 0.287sec

>>>r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')`
>>> str(r.base_datetime)
'2148-08-16 09:00:17.566000'
>>> r.to_dataframe()
                                I     II    III      V  aVR     Pleth      Resp
2148-08-16 09:00:17.566000000 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.582007043 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.598014086 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.614021129 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.630028172 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
...                            ..    ...    ...    ...  ...       ...       ...
2148-08-17 14:37:22.320232945 NaN -0.220 -0.285 -0.025  NaN  0.404297  0.487477
2148-08-17 14:37:22.336239988 NaN -0.030  0.005  0.025  NaN  0.396484  0.530238
2148-08-17 14:37:22.352247031 NaN -0.065 -0.030 -0.015  NaN  0.386475  0.574832
2148-08-17 14:37:22.368254074 NaN -0.265 -0.255 -0.125  NaN  0.375977  0.621258
2148-08-17 14:37:22.384261117 NaN -0.550 -0.610 -0.355  NaN  0.366211  0.664020
[6661120 rows x 7 columns]
>>> str(r.get_absolute_time(6661119)
'2148-08-17 14:37:22.384920'

The get_absolute_time is correct to the nearest microsecond 14:37:22.384920 and the to_dataframe() is correct to the nearest nanosecond 14:37:22.384261117 Please review the PR at your convenience.

bemoody commented 1 year ago

Thanks, Lama!

I have to nitpick here, though.

It looks like what this is actually doing is rounding the sample interval to the nearest nanosecond and then using fixed N-nanosecond steps (whereas previously it was rounding to the nearest microsecond.)

So this code is much more accurate than before but still has the same fundamental problem.

Here's how I figure the timestamp:

$ bc -l
9 * 3600 + 17.566 + 6661119 / 62.4725
139042.38492032494297490895
last - (24 + 14) * 3600 - 37 * 60
22.38492032494297490895

so the end time in nanseconds should be 14:37:22.384920325.

(On the other hand, 9 3600 + 17.566 + 6661119 0.016007043 - (24 + 14) 3600 - 37 60 = 22.384261117 exactly.)

I think this is a bug in pandas. But still this is an improvement and good enough for many practical purposes.

bemoody commented 1 year ago

This also fails in the case where base_datetime is None. Try it out with a record that doesn't have a base time, such as one of the records from mitdb. If base_datetime is None then we would want to use get_elapsed_time instead of get_absolute_time.

lamawmouk commented 1 year ago

Thank you Benjamin!

I have updated the changes. Please merge at your convenience.

bemoody commented 1 year ago

It looks like what this is actually doing is rounding the sample interval to the nearest nanosecond and then using fixed N-nanosecond steps (whereas previously it was rounding to the nearest microsecond.)

I get different results (on pandas main branch as well as on pandas 1.1.5), so perhaps this issue has been fixed?

Lama, what do you get if you run the following?

import pandas, platform
print(platform.python_implementation(), platform.python_version(), platform.platform())
print(pandas.__version__)
print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))
lamawmouk commented 1 year ago

Lama, what do you get if you run the following?

import pandas, platform
print(platform.python_implementation(), platform.python_version(), platform.platform())
print(pandas.__version__)
print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))

Below is my output:

>> print(platform.python_implementation(), platform.python_version(), platform.platform())
CPython 3.8.10 Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29
>> print(pandas.__version__)
2.0.2
>> print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))
TimedeltaIndex([          '0 days 00:00:00', '0 days 00:00:00.000000016',
'0 days 00:00:00.000000033', '0 days 00:00:00.000000050',
'0 days 00:00:00.000000066', '0 days 00:00:00.000000083',
'0 days 00:00:00.000000100'],
dtype='timedelta64[ns]', freq=None)

My pandas version is 2.0.2 and has timedelta64[ns]

bemoody commented 1 year ago

Are you sure that's the same version you used before (that gave an end time of 2148-08-17 14:37:22.384261117)?

Here's what I get:

>>> r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')
>>> r.to_dataframe()
                                I     II    III      V  aVR     Pleth      Resp
2148-08-16 09:00:17.566000000 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.582007043 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.598014086 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.614021129 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.630028172 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
...                            ..    ...    ...    ...  ...       ...       ...
2148-08-17 14:37:22.320891827 NaN -0.220 -0.285 -0.025  NaN  0.404297  0.487477
2148-08-17 14:37:22.336898870 NaN -0.030  0.005  0.025  NaN  0.396484  0.530238
2148-08-17 14:37:22.352905913 NaN -0.065 -0.030 -0.015  NaN  0.386475  0.574832
2148-08-17 14:37:22.368912956 NaN -0.265 -0.255 -0.125  NaN  0.375977  0.621258
2148-08-17 14:37:22.384920000 NaN -0.550 -0.610 -0.355  NaN  0.366211  0.664020

[6661120 rows x 7 columns]
lamawmouk commented 1 year ago

Are you sure that's the same version you used before (that gave an end time of 2148-08-17 14:37:22.384261117)? Yes, I reran it in a new environment.


>> r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')
>> r.to_dataframe()
I     II    III      V  aVR     Pleth      Resp
2148-08-16 09:00:17.566000000 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.582007043 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.598014086 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.614021129 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
2148-08-16 09:00:17.630028172 NaN    NaN    NaN    NaN  NaN       NaN -0.751374
...                            ..    ...    ...    ...  ...       ...       ...
2148-08-17 14:37:22.320232945 NaN -0.220 -0.285 -0.025  NaN  0.404297  0.487477
2148-08-17 14:37:22.336239988 NaN -0.030  0.005  0.025  NaN  0.396484  0.530238
2148-08-17 14:37:22.352247031 NaN -0.065 -0.030 -0.015  NaN  0.386475  0.574832
2148-08-17 14:37:22.368254074 NaN -0.265 -0.255 -0.125  NaN  0.375977  0.621258
2148-08-17 14:37:22.384261117 NaN -0.550 -0.610 -0.355  NaN  0.366211  0.664020

[6661120 rows x 7 columns]

import pandas, platform print(platform.python_implementation(), platform.python_version(), platform.platform()) CPython 3.9.13 Windows-10-10.0.19045-SP0 print(pandas.version) 2.0.2 print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7)) TimedeltaIndex([ '0 days 00:00:00', '0 days 00:00:00.000000016', '0 days 00:00:00.000000033', '0 days 00:00:00.000000050', '0 days 00:00:00.000000066', '0 days 00:00:00.000000083', '0 days 00:00:00.000000100'], dtype='timedelta64[ns]', freq=None)

bemoody commented 1 year ago

Thanks for checking. It is really strange that we're seeing different results. While I'd like to get to the bottom of this, I don't really have time to debug it right now.

It looks like the sequence of values originates from np.linspace and therefore might depend on the version of numpy or how it was compiled.