cytomining / CytoTable

Transform CellProfiler and DeepProfiler data for processing image-based profiling readouts with Pycytominer and other Cytomining tools.
https://cytomining.github.io/CytoTable/
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

Explore value sorting determinism (and possible changes) #175

Closed d33bs closed 3 months ago

d33bs commented 6 months ago

From #174:

I found that duckdb==0.10.1 entails lowered determinism for joined data result value sorting. We previously have relied on PyArrow to consistently sort all columns and their values for testing - this appears to also be potentially in need of adjustment. As a result, I've added SQL ORDER BY ALL (which orders all values in all columns from left to right) to one test which consistently failed with comparisons. As a to-do, we should explore why PyArrow is unable to perform the same level of data organization for testing and possibly raise an issue with that project if the results aren't in alignment with the design. Alternatively, moving to ORDER BY ALL or some parallel may be necessary to address the existing PyArrow-based data sorting for testing (otherwise we may sporadically see issues over time).

Because of the importance of this issue, adding that we need example cases where the fix has been validated with larger than testing datasets.

d33bs commented 6 months ago

Thinking on this more and exploring a bit, outlining thoughts and findings so far below. This appeared again in #181, so I'm focusing on figuring out more of the reasons why this might occur.

Patterns

When this happens, the following appear to be consistent patterns:

Possible explanations

As a quick check I tried verifying that PyArrow sorting works the way it should. It seems that it does properly sort all values by all columns when implemented the way it is in CytoTable tests. See here for code demonstrating this.

I feel there are several other possibilities for what's occurring which I'll work through in order to verify what's happening.