cytomining / CytoTable

Transform CellProfiler and DeepProfiler data for processing image-based profiling readouts with Pycytominer and other Cytomining tools.
https://cytomining.github.io/CytoTable/
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

Verify how large floating-point numbers are exported #187

Open d33bs opened 6 months ago

d33bs commented 6 months ago

While working on cytominer-database data comparisons for #30 I noticed variations in how floating point numbers are treated. As an example, the number 3.5215257120407011 may be found within .tests/data/cytominer-database/data_b/A01-2/Cytoplasm.csv. This number is extracted as 3.521525712040701 (one decimal place less, through automatic CSV settings in DuckDB interpreted as a DOUBLE). Similarly, cyotminer-database appears to change this number to 3.5215257120407006 (three additional decimal places, I believe this is done through Pandas CSV and/or pd.Series.as_type() casts).

I found that DuckDB can correctly interpret this number automatically as a DECIMAL but that this option doesn't appear to be available through the CSV reader yet (planning to open an issue here). As it stands, this number is interpreted as a DOUBLE, which truncates the data.

I did some work to try and research how this number is interpreted in Python, NumPy, Pandas, DuckDB, and Parquet. Both Python (through Decimal) and NumPy (through longdouble) were able to interpret an extended version of the number but not the precise number itself (similar to how cytominer-database operated above, adding decimal places). Pandas is able to accurately interpret the number through PyArrow types (specifically decimal128). DuckDB seems to follow the same decimal-style formatting as PyArrow and is able to inference the width + scale when reading the number alone. From an Arrow decimal128 type, Parquet is able to write and read the number accurately through I believe the decimal logical type.

See here for a Google Colab notebook with findings (and a gist backup).

d33bs commented 6 months ago

Opened https://github.com/duckdb/duckdb/issues/11639 in reference to this issue.

d33bs commented 6 months ago

Just a quick update to mention I thought to try PyArrow arrays and the built-in CSV reader. The CSV reader turned out to have similar challenges (truncating the number one decimal place). When attempting to create an array of pa.decimal128(17, 16) type with the number from a Python List I saw an error (though I may be using it incorrectly). I updated the notebook and gist as a reference point.