Error on processing `moto`-based files within scheduled, un-locked dependency tests

d33bs commented 5 months ago

An error occurred with two of the tests during a scheduled and un-locked dependency test earlier today. The only failing test in these cases is: tests/test_convert_threaded.py::test_convert_s3_path_csv. This seems to happen due to a missing file from the mocked moto bucket to a local file, resulting in a failed data query from DuckDB.

Reference to the job: https://github.com/cytomining/CytoTable/actions/runs/8812487779/job/24204265980

duckdb.duckdb.InvalidInputException: Invalid Input Error: Attempting to execute an unsuccessful or closed pending query result
Error: IO Error: No files found that match the pattern "/tmp/tmpv8ak2gtt/example/2/cells.csv"
...
FAILED tests/test_convert_threaded.py::test_convert_s3_path_csv - parsl.dataflow.errors.DependencyError: <exception str() failed>

d33bs commented 5 months ago

This turned out to be a cloudpathlib caching issue, which appears to have changed enough in recent versions enough that the garbage collector removes cached files before they can be referenced by CytoTable. I believe this was a bug present in the code prior to the updates, rather than being caused by the dependencies themselves (though cloudpathlib or related dependencies may have impacted the results).

On resolving the caching issue, there appears to be a credentialing issue from botocore.

d33bs commented 4 months ago

Looking further into this I feel that there's likely something happening which doesn't play nice between moto and cloudpathlib. There are alternatives such as localstack which we could explore, but generally I think we're following a path towards a possible anti-pattern with mocked resources which is consuming too much time for the focus of the project. A reference from Software Engineering at Google in Chapter 13 under the "Real Implementations" section somewhat backs this up:

Although test doubles can be invaluable testing tools, our first choice for tests is to use the real implementations of the system under test’s dependencies; that is, the same implementations that are used in production code. Tests have higher fidelity when they execute code as it will be executed in production, and using real implementations helps accomplish this. ... Using real implementations for dependencies makes the system under test more realistic given that all code in these real implementations will be executed in the test.

I chatted with @gwaybio about this briefly, and based on our discussion feel that removing moto to rely on a real-world test would have better outcomes. This would allow us to avoid a mocked resource dependency for testing (and related issues), as well as provide us an avenue for a larger data test (data which we don't want to store in the repository)(possibly addressing https://github.com/cytomining/CytoTable/issues/193). Likely it'd be good to follow the patterns and suggestions on a testing resource from cytomining/pycytominer#375 .

Rough steps:

[ ] Remove moto dependency
[ ] Remove pytest fixutres related to moto
[ ] Modify moto-related tests to read from cellpainting-gallery or similar S3 resource.
[ ] Limit these renewed tests to a certain time limit within reason (possibly leveraging something like https://github.com/bloomberg/pytest-memray or other built-in features of pytest if possible)
[ ] Use custom markers on these renewed tests to skip on local development and be used for periodic GH Actions-based scheduled testing
[ ] Modify GH Actions workflows to run these tests weekly.

cytomining / CytoTable

Error on processing `moto`-based files within scheduled, un-locked dependency tests #198