d33bs commented 5 months ago

Description

This PR seeks to refine #175 by increasing the performance through generated CytoTable metadata columns which are primarily beneficial during large join operations. Anecdotally, I noticed that ORDER BY ALL memory consumption for joined tables becomes very high when working with a larger dataset. Before this change, large join operations attempt to sort by all columns included in the join. After this change, only CytoTable metadata columns are used for sorting, decreasing the amount of processing required to create deterministic datasets.

I hope to further refine this work through #193 and #176, which would I feel provide additional insights concerning performance and best practice recommendations. I can also see how these might be required to validate things here, but didn't want to hold review comments (as these also might further inform efforts within those issues).

Closes #175

What is the nature of your change?

[ ] Bug fix (fixes an issue).
[x] Enhancement (adds functionality).
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected).
[ ] This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

[x] I have read the CONTRIBUTING.md guidelines.
[x] My code follows the style guidelines of this project.
[x] I have performed a self-review of my own code.
[x] I have commented my code, particularly in hard-to-understand areas.
[ ] I have made corresponding changes to the documentation.
[x] My changes generate no new warnings.
[x] New and existing unit tests pass locally with my changes.
[x] I have added tests that prove my fix is effective or that my feature works.
[x] I have deleted all non-relevant text in this pull request template.

gwaybio commented 3 months ago

(some additional context @falquaddoomi - we are needing to solve this for an upcoming project that will use cytotable heavily. Thanks!)

d33bs commented 3 months ago

Thanks @gwaybio and @falquaddoomi for the reviews! I like the idea of an optional setting for this sorting mechanism, with a possible backup method which doesn't leverage CytoTable metadata.

Generally, I still feel that sorting should be required to guarantee no data loss with LIMIT and OFFSET because this aligns with both DuckDB's docs and general SQL guidance. A hypothesis about what was allowing this to succeed in earlier work: DuckDB may have successfully retained all data with LIMIT and OFFSET queries through low system process and thread competition. The failing tests for LIMIT and OFFSET I believe nearly always dealt with multithreaded behavior in moto, meaning procedures may have been subject to system scheduler decisions about which tasks to delay vs execute (or perhaps there were system thread or memory leaks of some kind).

While we plan to remove moto as a dependency by addressing #198, it feels fuzzy yet to me whether these challenges are all the same. For example, it could be that moto triggered a coincidental mutation test with regard to DuckDB thread behavior (giving us further software visibility through a mutated test state). It could have also been a "perfect storm" through a bug in DuckDB >0.10.x,<1.0.0 combined with moto's behavior in tests. Then again, this could all just be my imagination, I'm not sure!

d33bs commented 3 months ago

Note: Initially failing tests for 4ffe9c1 appeared to have something to do with a Poetry (and not CytoTable) dependency failure (maybe fixed through a deploy by the time of a 3rd re-run?). I don't think these are related to CytoTable code as they were at the layer of Poetry installations.

Errors were: AttributeError: '_CountedFileLock' object has no attribute 'thread_safe' from virtualenv and filelock site-packages.

Update: appears related to https://github.com/tox-dev/filelock/issues/337

d33bs commented 3 months ago

Thanks again @gwaybio and @falquaddoomi ! I've added some updates which make sorting optional through the use of parameters called sort_output. These changes retain the ability to keep output sorted and also an option to avoid it altogether (reverting to earlier CytoTable behavior). I've kept the default to sort_output=True as I feel this is the safest option for the time being, but understand there may be reasons to avoid it based on the data or performance desired.

d33bs commented 3 months ago

Cheers, thanks @falquaddoomi ! Agreed on comparisons; it will be interesting to see the contrast, excited to learn more!

cytomining / CytoTable

Increase sorting scalability via CytoTable metadata columns #204

Description

What is the nature of your change?

Checklist