Mutation of floating-point values stored in the database

ahoenselaar commented 11 years ago

Let me preface this by saying that the issue I describe here is not specific to datajoint or mym. However, I think that anybody who uses MySQL as a storage backend for scientific data should be aware of certain issues regarding floating-point columns.

Expected behavior

When storing a value x in a database column of an appropriate datatype, we expect to be able to retrieve the same value later. In particular we hope that a 32-bit float ("single") value with a certain binary representation in memory will have that same binary representation when retrieved from a "float" column in the database. The same should hold for a 64-bit floats ("double") and "double" columns.

Potential problem sources

Data exchange between MySQL and datajoint or other connectors is entirely text based. Hence, all floating-point numbers passed to and from datajoint will go through repeated sprintf-style formatting into a string and will later be parsed back into a binary representation. Whether the original value is retained or not depends on the number of digits and algorithms used for conversion.

Observed behavior for 32-bit floats

32-bit floats are returned from the database with significant deviations from the original values. For this experiment, only values that had an exact representation as 32-bit floats were inserted into the table. MySQL formats the values in float columns very aggressively using no more than six digits in the output:

MariaDB [lfp]> SELECT fval FROM tests LIMIT 5;
+----------+
| fval     |
+----------+
| -9162.98 |
| -894.557 |
| -8368.18 |
|  3223.23 |
|  -464.57 |
+----------+
5 rows in set (0.00 sec)

This level of precision is not sufficient to recover the original float value. Most deviations are within 50_ULP(float) but there are outliers of up to 90_ULP(float) gh1 As indicated in the datajoint documentation, the float datatype should not be used when high precision is required. Furthermore, the MySQL documentation is quite devoid of any guarantees regarding floats or their conversion from and to strings or the decimal data type. However, scrambling the last six bits of the mantissa of an already imprecise value just a bit more seems unnecessary.

Observed behavior for 64-bit floats

MySQL uses the dtoa library for conversion of double-precision floating points to and from strings. However, "correct" retrieval from the database as specified above is only possible if the decimal representation uses 15 digits or less. Again, this causes slight mutations of values returned from database queries, most in the range of 3_ULP(double), a few reach 4_ULP(double). While also quite unnecessary - a single additional digit could have made all the difference - this strikes me as acceptable for most scenarios.

Work-arounds

Use "decimal" columns whenever the value range of your data is predictable.
Avoid the "float" data type, unless precision is really not a concern
Float-values can be widened to double precision by multiplying with the numeric literal 1.0 in SELECT statements. This produces more digits in the output and completely abolishes the problem for 32-bit floats.
Use blobs to store data that can remain opaque to the database, i.e. won't be used in restrictions
NEVER restrict Relvars using equality with a floating-point value. This is a bad practice for floating point numbers in any setting. In MySQL, things get even worse because "float" columns are widened to double before performing comparisons. So you can insert the a tuple with the float value 0.3, then query for it and get zero hits.

FlorianFranzen commented 11 years ago

I guess we can not do anything about this. As far as I know this is a weakness of all Structured Query Language implementations.

You created a nice overview though, so we probably should move this to the wiki.

FlorianFranzen commented 10 years ago

I added this as a page to the wiki: https://github.com/datajoint/datajoint-matlab/wiki/Floating-point-mutations

I guess there is nothing more to do, so I will close this ticket.

datajoint / datajoint-matlab