This changes in this PR address #198 by removing moto and related tests to avoid existing and future challenges with mocked s3 resources (in addition to the failing tests, a short justification can be found here). Moving forward, CytoTable will now be tested on a CSV and SQLite resource from the cellpainting-gallery (as outlined in the fixtures). As a result, I believe we now are addressing #193 within reason because of the SQLite addition (please don't hesitate to let me know if you feel otherwise and we should keep that issue open).
In the process of developing towards this fix I added a new preset which enables compatibility with JUMP data (cpg0016-jump) from the cellpainting-gallery in order to effectively perform a test from an S3 SQlite object (no other presets appeared to exactly match this need). CC @jenna-tomkinson
Some notes:
I noticed functions within the sources module experiencing challenges from Parsl app decoration due to how the cloudpathlib client is constructed and how logging operates with HTEX executor tasks (which sometimes makes errors less visible). Thinking through this had me recognize that there didn't seem to be a major benefit from these being decorated this way. Because of this, I changed these functions to be un-decorated, which simplified troubleshooting and feels like a good option to reduce complexity in the future.
I moved logic found within convert for checking that a file has at least one row to sources through _file_is_more_than_one_line to help keep the use of cloudpathlib.CloudPath's consistent and filter data sources earlier in the flow before we begin processing them.
I noticed cytominer-database was consistently experiencing challenges within Python 3.12 environments (tests couldn't find the ingest command line option). As a result, I've moved to static files produced from cytominer-database which increase the size of this repository but should help us save time during tests and avoid the error I was seeing during development.
After removing the cytominer-database processing step, I noticed that there were some missing CSV's from the cytominer-database test data, which have been added in this PR.
A chunk size of 30000 was used after trying many different iterations with the S3 SQLite file test. GitHub Actions runner containers appear to experience resource challenges at sizes above this. GitHub Actions doesn't appear to provide exact details on resource use overages upfront, so I expect the chunk size may need tinkering over time.
Sorting results for any chunk size with the S3 SQLite file appeared to trigger GitHub Actions runner container challenges so I've turned this off for that particular test.
I've split the large_data_tests marked tests to be run as a separate job. Running things together with the other tests and test matrix appeared to have trouble with resources available in GitHub Actions runner containers. While the time involved to run the single test in this low for the time being, we may need to split it into another scheduled test (or something similar) if/when the mark's tests grow. This may be related to challenges with #211, Python garbage collection, or GitHub Actions runner container storage availability (more investigation is needed).
A minor fix to address an issue related to https://github.com/duckdb/duckdb/issues/3243 was applied to duckdb connection setup (this seems to sporadically appear, disappear, and reappear again within GitHub Actions runner environments).
Closes #198
Closes #193
What is the nature of your change?
[x] Bug fix (fixes an issue).
[ ] Enhancement (adds functionality).
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected).
[ ] This change requires a documentation update.
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.
Thank you @gwaybio ! After making some changes I feel good about how this now looks. I plan to create a new issue to explore memory resource reductions for sorted joins.
Description
This changes in this PR address #198 by removing
moto
and related tests to avoid existing and future challenges with mockeds3
resources (in addition to the failing tests, a short justification can be found here). Moving forward, CytoTable will now be tested on a CSV and SQLite resource from thecellpainting-gallery
(as outlined in the fixtures). As a result, I believe we now are addressing #193 within reason because of the SQLite addition (please don't hesitate to let me know if you feel otherwise and we should keep that issue open).In the process of developing towards this fix I added a new preset which enables compatibility with JUMP data (
cpg0016-jump
) from thecellpainting-gallery
in order to effectively perform a test from an S3 SQlite object (no other presets appeared to exactly match this need). CC @jenna-tomkinsonSome notes:
sources
module experiencing challenges from Parsl app decoration due to how thecloudpathlib
client is constructed and how logging operates with HTEX executor tasks (which sometimes makes errors less visible). Thinking through this had me recognize that there didn't seem to be a major benefit from these being decorated this way. Because of this, I changed these functions to be un-decorated, which simplified troubleshooting and feels like a good option to reduce complexity in the future.convert
for checking that a file has at least one row tosources
through_file_is_more_than_one_line
to help keep the use ofcloudpathlib.CloudPath
's consistent and filter data sources earlier in the flow before we begin processing them.cytominer-database
was consistently experiencing challenges within Python 3.12 environments (tests couldn't find theingest
command line option). As a result, I've moved to static files produced fromcytominer-database
which increase the size of this repository but should help us save time during tests and avoid the error I was seeing during development.cytominer-database
processing step, I noticed that there were some missing CSV's from thecytominer-database
test data, which have been added in this PR.30000
was used after trying many different iterations with the S3 SQLite file test. GitHub Actions runner containers appear to experience resource challenges at sizes above this. GitHub Actions doesn't appear to provide exact details on resource use overages upfront, so I expect the chunk size may need tinkering over time.large_data_tests
marked tests to be run as a separate job. Running things together with the other tests and test matrix appeared to have trouble with resources available in GitHub Actions runner containers. While the time involved to run the single test in this low for the time being, we may need to split it into another scheduled test (or something similar) if/when the mark's tests grow. This may be related to challenges with #211, Python garbage collection, or GitHub Actions runner container storage availability (more investigation is needed).duckdb
connection setup (this seems to sporadically appear, disappear, and reappear again within GitHub Actions runner environments).Closes #198 Closes #193
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.