Description

This changes in this PR address #198 by removing moto and related tests to avoid existing and future challenges with mocked s3 resources (in addition to the failing tests, a short justification can be found here). Moving forward, CytoTable will now be tested on a CSV and SQLite resource from the cellpainting-gallery (as outlined in the fixtures). As a result, I believe we now are addressing #193 within reason because of the SQLite addition (please don't hesitate to let me know if you feel otherwise and we should keep that issue open).

In the process of developing towards this fix I added a new preset which enables compatibility with JUMP data (cpg0016-jump) from the cellpainting-gallery in order to effectively perform a test from an S3 SQlite object (no other presets appeared to exactly match this need). CC @jenna-tomkinson

Some notes:

I noticed functions within the sources module experiencing challenges from Parsl app decoration due to how the cloudpathlib client is constructed and how logging operates with HTEX executor tasks (which sometimes makes errors less visible). Thinking through this had me recognize that there didn't seem to be a major benefit from these being decorated this way. Because of this, I changed these functions to be un-decorated, which simplified troubleshooting and feels like a good option to reduce complexity in the future.
I moved logic found within convert for checking that a file has at least one row to sources through _file_is_more_than_one_line to help keep the use of cloudpathlib.CloudPath's consistent and filter data sources earlier in the flow before we begin processing them.
I noticed cytominer-database was consistently experiencing challenges within Python 3.12 environments (tests couldn't find the ingest command line option). As a result, I've moved to static files produced from cytominer-database which increase the size of this repository but should help us save time during tests and avoid the error I was seeing during development.
After removing the cytominer-database processing step, I noticed that there were some missing CSV's from the cytominer-database test data, which have been added in this PR.
A chunk size of 30000 was used after trying many different iterations with the S3 SQLite file test. GitHub Actions runner containers appear to experience resource challenges at sizes above this. GitHub Actions doesn't appear to provide exact details on resource use overages upfront, so I expect the chunk size may need tinkering over time.
Sorting results for any chunk size with the S3 SQLite file appeared to trigger GitHub Actions runner container challenges so I've turned this off for that particular test.
I've split the large_data_tests marked tests to be run as a separate job. Running things together with the other tests and test matrix appeared to have trouble with resources available in GitHub Actions runner containers. While the time involved to run the single test in this low for the time being, we may need to split it into another scheduled test (or something similar) if/when the mark's tests grow. This may be related to challenges with #211, Python garbage collection, or GitHub Actions runner container storage availability (more investigation is needed).
A minor fix to address an issue related to https://github.com/duckdb/duckdb/issues/3243 was applied to duckdb connection setup (this seems to sporadically appear, disappear, and reappear again within GitHub Actions runner environments).

Closes #198 Closes #193

What is the nature of your change?

[x] Bug fix (fixes an issue).
[ ] Enhancement (adds functionality).
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected).
[ ] This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

[x] I have read the CONTRIBUTING.md guidelines.
[x] My code follows the style guidelines of this project.
[x] I have performed a self-review of my own code.
[x] I have commented my code, particularly in hard-to-understand areas.
[ ] I have made corresponding changes to the documentation.
[x] My changes generate no new warnings.
[x] New and existing unit tests pass locally with my changes.
[x] I have added tests that prove my fix is effective or that my feature works.
[x] I have deleted all non-relevant text in this pull request template.

cytomining / CytoTable

Use real AWS S3 data tests and apply related fixes #212

Description

What is the nature of your change?

Checklist