Open lamawmouk opened 1 year ago
Thanks, Lama!
I have to nitpick here, though.
It looks like what this is actually doing is rounding the sample interval to the nearest nanosecond and then using fixed N-nanosecond steps (whereas previously it was rounding to the nearest microsecond.)
So this code is much more accurate than before but still has the same fundamental problem.
Here's how I figure the timestamp:
$ bc -l
9 * 3600 + 17.566 + 6661119 / 62.4725
139042.38492032494297490895
last - (24 + 14) * 3600 - 37 * 60
22.38492032494297490895
so the end time in nanseconds should be 14:37:22.384920325.
(On the other hand, 9 3600 + 17.566 + 6661119 0.016007043 - (24 + 14) 3600 - 37 60 = 22.384261117 exactly.)
I think this is a bug in pandas. But still this is an improvement and good enough for many practical purposes.
This also fails in the case where base_datetime is None. Try it out with a record that doesn't have a base time, such as one of the records from mitdb
. If base_datetime
is None then we would want to use get_elapsed_time
instead of get_absolute_time
.
Thank you Benjamin!
I have updated the changes. Please merge at your convenience.
It looks like what this is actually doing is rounding the sample interval to the nearest nanosecond and then using fixed N-nanosecond steps (whereas previously it was rounding to the nearest microsecond.)
I get different results (on pandas main branch as well as on pandas 1.1.5), so perhaps this issue has been fixed?
Lama, what do you get if you run the following?
import pandas, platform
print(platform.python_implementation(), platform.python_version(), platform.platform())
print(pandas.__version__)
print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))
Lama, what do you get if you run the following?
import pandas, platform print(platform.python_implementation(), platform.python_version(), platform.platform()) print(pandas.__version__) print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7))
Below is my output:
>> print(platform.python_implementation(), platform.python_version(), platform.platform()) CPython 3.8.10 Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29 >> print(pandas.__version__) 2.0.2 >> print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7)) TimedeltaIndex([ '0 days 00:00:00', '0 days 00:00:00.000000016', '0 days 00:00:00.000000033', '0 days 00:00:00.000000050', '0 days 00:00:00.000000066', '0 days 00:00:00.000000083', '0 days 00:00:00.000000100'], dtype='timedelta64[ns]', freq=None)
My pandas version is 2.0.2 and has timedelta64[ns]
Are you sure that's the same version you used before (that gave an end time of 2148-08-17 14:37:22.384261117)?
Here's what I get:
>>> r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')
>>> r.to_dataframe()
I II III V aVR Pleth Resp
2148-08-16 09:00:17.566000000 NaN NaN NaN NaN NaN NaN -0.751374
2148-08-16 09:00:17.582007043 NaN NaN NaN NaN NaN NaN -0.751374
2148-08-16 09:00:17.598014086 NaN NaN NaN NaN NaN NaN -0.751374
2148-08-16 09:00:17.614021129 NaN NaN NaN NaN NaN NaN -0.751374
2148-08-16 09:00:17.630028172 NaN NaN NaN NaN NaN NaN -0.751374
... .. ... ... ... ... ... ...
2148-08-17 14:37:22.320891827 NaN -0.220 -0.285 -0.025 NaN 0.404297 0.487477
2148-08-17 14:37:22.336898870 NaN -0.030 0.005 0.025 NaN 0.396484 0.530238
2148-08-17 14:37:22.352905913 NaN -0.065 -0.030 -0.015 NaN 0.386475 0.574832
2148-08-17 14:37:22.368912956 NaN -0.265 -0.255 -0.125 NaN 0.375977 0.621258
2148-08-17 14:37:22.384920000 NaN -0.550 -0.610 -0.355 NaN 0.366211 0.664020
[6661120 rows x 7 columns]
Are you sure that's the same version you used before (that gave an end time of 2148-08-17 14:37:22.384261117)? Yes, I reran it in a new environment.
>> r = wfdb.rdrecord('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927') >> r.to_dataframe() I II III V aVR Pleth Resp 2148-08-16 09:00:17.566000000 NaN NaN NaN NaN NaN NaN -0.751374 2148-08-16 09:00:17.582007043 NaN NaN NaN NaN NaN NaN -0.751374 2148-08-16 09:00:17.598014086 NaN NaN NaN NaN NaN NaN -0.751374 2148-08-16 09:00:17.614021129 NaN NaN NaN NaN NaN NaN -0.751374 2148-08-16 09:00:17.630028172 NaN NaN NaN NaN NaN NaN -0.751374 ... .. ... ... ... ... ... ... 2148-08-17 14:37:22.320232945 NaN -0.220 -0.285 -0.025 NaN 0.404297 0.487477 2148-08-17 14:37:22.336239988 NaN -0.030 0.005 0.025 NaN 0.396484 0.530238 2148-08-17 14:37:22.352247031 NaN -0.065 -0.030 -0.015 NaN 0.386475 0.574832 2148-08-17 14:37:22.368254074 NaN -0.265 -0.255 -0.125 NaN 0.375977 0.621258 2148-08-17 14:37:22.384261117 NaN -0.550 -0.610 -0.355 NaN 0.366211 0.664020
[6661120 rows x 7 columns]
import pandas, platform print(platform.python_implementation(), platform.python_version(), platform.platform()) CPython 3.9.13 Windows-10-10.0.19045-SP0 print(pandas.version) 2.0.2 print(pandas.timedelta_range(pandas.Timedelta(0), pandas.Timedelta(100), periods=7)) TimedeltaIndex([ '0 days 00:00:00', '0 days 00:00:00.000000016', '0 days 00:00:00.000000033', '0 days 00:00:00.000000050', '0 days 00:00:00.000000066', '0 days 00:00:00.000000083', '0 days 00:00:00.000000100'], dtype='timedelta64[ns]', freq=None)
Thanks for checking. It is really strange that we're seeing different results. While I'd like to get to the bottom of this, I don't really have time to debug it right now.
It looks like the sequence of values originates from np.linspace
and therefore might depend on the version of numpy or how it was compiled.
This pull request fixes issue #444.
This PR refactors the index generation logic in the
to_dataframe
method inBaseRecord class
The current code uses the
freq
argument in thepd.date_range
andpd.timedelta_range
functions to generate the index for the DataFrame. However, usingend
argument improves accuracy of the time without being off by 0.287secThe get_absolute_time is correct to the nearest microsecond 14:37:22.384920 and the to_dataframe() is correct to the nearest nanosecond 14:37:22.384261117 Please review the PR at your convenience.