datajoint / datajoint-matlab

Relational data pipelines for the science lab
MIT License
42 stars 39 forks source link

Mutation of floating-point values stored in the database #12

Closed ahoenselaar closed 10 years ago

ahoenselaar commented 11 years ago

Let me preface this by saying that the issue I describe here is not specific to datajoint or mym. However, I think that anybody who uses MySQL as a storage backend for scientific data should be aware of certain issues regarding floating-point columns.

Expected behavior

When storing a value x in a database column of an appropriate datatype, we expect to be able to retrieve the same value later. In particular we hope that a 32-bit float ("single") value with a certain binary representation in memory will have that same binary representation when retrieved from a "float" column in the database. The same should hold for a 64-bit floats ("double") and "double" columns.

Potential problem sources

Data exchange between MySQL and datajoint or other connectors is entirely text based. Hence, all floating-point numbers passed to and from datajoint will go through repeated sprintf-style formatting into a string and will later be parsed back into a binary representation. Whether the original value is retained or not depends on the number of digits and algorithms used for conversion.

Observed behavior for 32-bit floats

32-bit floats are returned from the database with significant deviations from the original values. For this experiment, only values that had an exact representation as 32-bit floats were inserted into the table. MySQL formats the values in float columns very aggressively using no more than six digits in the output:

MariaDB [lfp]> SELECT fval FROM tests LIMIT 5;
+----------+
| fval     |
+----------+
| -9162.98 |
| -894.557 |
| -8368.18 |
|  3223.23 |
|  -464.57 |
+----------+
5 rows in set (0.00 sec)

This level of precision is not sufficient to recover the original float value. Most deviations are within 50_ULP(float) but there are outliers of up to 90_ULP(float) gh1 As indicated in the datajoint documentation, the float datatype should not be used when high precision is required. Furthermore, the MySQL documentation is quite devoid of any guarantees regarding floats or their conversion from and to strings or the decimal data type. However, scrambling the last six bits of the mantissa of an already imprecise value just a bit more seems unnecessary.

Observed behavior for 64-bit floats

MySQL uses the dtoa library for conversion of double-precision floating points to and from strings. However, "correct" retrieval from the database as specified above is only possible if the decimal representation uses 15 digits or less. Again, this causes slight mutations of values returned from database queries, most in the range of 3_ULP(double), a few reach 4_ULP(double). While also quite unnecessary - a single additional digit could have made all the difference - this strikes me as acceptable for most scenarios.

Work-arounds

FlorianFranzen commented 11 years ago

I guess we can not do anything about this. As far as I know this is a weakness of all Structured Query Language implementations.

You created a nice overview though, so we probably should move this to the wiki.

FlorianFranzen commented 10 years ago

I added this as a page to the wiki: https://github.com/datajoint/datajoint-matlab/wiki/Floating-point-mutations

I guess there is nothing more to do, so I will close this ticket.