chanzuckerberg / cryoet-data-portal

CryoET Data Portal
MIT License
16 stars 9 forks source link

Identify framework for validation #932

Closed manasaV3 closed 1 month ago

daniel-ji commented 2 months ago

AWS S3 Data Validation Proposal

TL;DR I think there isn’t any super convincing case to switch away from Pytest + Allure right now and it’s better to probably keep things simple and reduce code rewrites. Definitely open to any suggestions and opinions though!


Needs
Ease of integration: We can run the validation on an EC2 instance using the staging bucket, or locally, and then store the test results. This is pretty much already handled with our current ingestion_tools.common.fs script.

Flexibility: The data/file type and structure varies significantly across different files for different objects (tomograms, tiltseries, annotations, etc.) It doesn’t seem feasible / necessary to create a full-blown schema to model the data: instead, it seems more logical to work on a case-by-case basis for different kinds of files, parse the data into an object (possibly an instance of a class) and then use it for various tests. The framework should allow us to use whatever necessary other libraries to easily read files and extract the relevant data to validate.

Extensibility: We want to be able to easily add new, custom validation rules as requirements change.

Reporting: We should be able to report the errors in a concise, organized manner.


Overall, I don’t think any Python framework cleanly handles the various file types we might receive for different objects. Regardless, we have to do the file reading / parsing, so we should just write it from scratch so that it’s as customizable and lightweight as possible, and then integrate it with a lightweight, well-supported framework so that there’s no unnecessary boilerplate / overhead.

Schema-based validation like Pydantic, Marshmallow, and Cerberus may be good options to consider if we’re thinking of doing heavy inter-file checking across different objects’ files (like comparing that run data lines up with tiltseries and aligned files and tomograms, which each intertwine also). But as of now, I’m leaning towards more of a KISS approach (where Pytest is likely sufficient to do all the checking). But then again, I’m new to all of this, so maybe I’ll only be able to find out what was the better choice once we try building it. Right now, it feels unnecessary to transform s3 file data into a schema structure right now (there would really only be one-two levels of classes, one being the container of all the possible objects, and the other being the objects themselves).

In regard to reporting, Allure gives a good balance of complexity, a not-too-steep learning curve, reporting, hierarchical view, and room for future customization. Utz has already done some work to set it up and it matches the medium-sized scope of our task for s3 data validation quite well, so I think it’s a good choice to continue with moving forward.


Testing Framework Comparison

Pytest https://docs.pytest.org/en/stable/

Pros:

Cons:

Pydantic https://docs.pydantic.dev/latest/

Pros:

Cons:

Great Expectations https://docs.greatexpectations.io/docs/oss/about/

Pros:

Cons:

Cerberus https://docs.python-cerberus.org/index.html

Pretty similar to Pydantic (schema-based), but has a smaller ecosystem and simpler validation rules. Operates only on dictionaries, so we may have to do extra porting since we have varying data structures from different files. Doesn’t support modeling like Pydantic. With Pydantic already being used for dataset config validation, it makes sense to just pick Pydantic over this.

Marshmallow https://marshmallow.readthedocs.io/en/stable/
Pydantic, but with a heavier emphasis on data serialization. Probably just use Pydantic.

Pandera https://pandera.readthedocs.io/en/stable/dtype_validation.html

Schema-based data validation, but based on Pandas DataFrames. Many of our files are not trivially converted to Pandas DataFrames, so this does not suit our use case. Closest thing to this that would be worth using is probably Pydantic.


Reporting Framework Comparison

Allure https://allurereport.org/docs/pytest/

Pros:

Cons:

pytest-html https://github.com/pytest-dev/pytest-html

Pros:

Cons:

built-in Pytest JUnitXML feature

Pros:

Cons:

ReportPortal https://reportportal.io/docs/log-data-in-reportportal/test-framework-integration/Python/pytest/

Pros:

Cons:

BrowserStack https://www.browserstack.com/docs/test-observability/quick-start/pytest

Pros:

Cons:

Calliope Pro https://docs.calliope.pro/supported-tools/pytest/

Pros:

Cons:

daniel-ji commented 1 month ago

As of now, we decided to go with Allure + Pytest. Will re-open if decision changes / discussion arises again.