Identify framework for validation

AWS S3 Data Validation Proposal

TL;DR I think there isn’t any super convincing case to switch away from Pytest + Allure right now and it’s better to probably keep things simple and reduce code rewrites. Definitely open to any suggestions and opinions though!

Needs
Ease of integration: We can run the validation on an EC2 instance using the staging bucket, or locally, and then store the test results. This is pretty much already handled with our current ingestion_tools.common.fs script.

Flexibility: The data/file type and structure varies significantly across different files for different objects (tomograms, tiltseries, annotations, etc.) It doesn’t seem feasible / necessary to create a full-blown schema to model the data: instead, it seems more logical to work on a case-by-case basis for different kinds of files, parse the data into an object (possibly an instance of a class) and then use it for various tests. The framework should allow us to use whatever necessary other libraries to easily read files and extract the relevant data to validate.

Extensibility: We want to be able to easily add new, custom validation rules as requirements change.

Reporting: We should be able to report the errors in a concise, organized manner.

Hierarchical view for test results (failed & succeeded)
Historical data (nice to have)
Filter (nice to have)

Overall, I don’t think any Python framework cleanly handles the various file types we might receive for different objects. Regardless, we have to do the file reading / parsing, so we should just write it from scratch so that it’s as customizable and lightweight as possible, and then integrate it with a lightweight, well-supported framework so that there’s no unnecessary boilerplate / overhead.

Schema-based validation like Pydantic, Marshmallow, and Cerberus may be good options to consider if we’re thinking of doing heavy inter-file checking across different objects’ files (like comparing that run data lines up with tiltseries and aligned files and tomograms, which each intertwine also). But as of now, I’m leaning towards more of a KISS approach (where Pytest is likely sufficient to do all the checking). But then again, I’m new to all of this, so maybe I’ll only be able to find out what was the better choice once we try building it. Right now, it feels unnecessary to transform s3 file data into a schema structure right now (there would really only be one-two levels of classes, one being the container of all the possible objects, and the other being the objects themselves).

In regard to reporting, Allure gives a good balance of complexity, a not-too-steep learning curve, reporting, hierarchical view, and room for future customization. Utz has already done some work to set it up and it matches the medium-sized scope of our task for s3 data validation quite well, so I think it’s a good choice to continue with moving forward.

Testing Framework Comparison

Pytest https://docs.pytest.org/en/stable/

Pros:

Widely used for testing in Python, with strong community support and extensive documentation.
Supports fixtures and parameterized tests, making it easy to set up reusable test configurations, reusing fetched file data values.
Flexible and allows integration with various libraries to handle different file types.
Lightweight and easy to set up, making it ideal for projects where you want to avoid unnecessary boilerplate and overhead.
Extensive plugin ecosystem for extending functionality, including reporting plugins like (pytest-html, pytest-json-output, pytest-allure-adaptor, etc.)
Currently what Utz uses for the existing s3 data validation.

Cons:

Primarily designed for testing code, not specifically for file validation.
May require significant customization to validate complex file formats and metadata.
Not inherently focused on data validation, so additional effort is needed to tailor it for this purpose.

Pydantic https://docs.pydantic.dev/latest/

Pros:

Lightweight and integrates easily with existing Python code.
Can easily be used alongside other libraries for file reading and parsing.
Data validation and settings management using Python type annotations, providing clear and concise validation rules.
Supports complex nested data structures and custom validators.
Integrates well with popular data manipulation libraries like Pandas.
Can be customized to generate standardized reports with detailed error messages.

Cons:

Schema-based, which may not be ideal for our use case, considering that a schema might be too much overhead (it may just be one model for one type of file, and then a larger superclass model that does cross-file checking).
Smaller ecosystem and fewer built-in features compared to more comprehensive validation frameworks.
Also prioritizes serialization / deserialization, which is something not that relevant for just raw data validation.

Great Expectations https://docs.greatexpectations.io/docs/oss/about/

Pros:

Integrates with popular data orchestration tools like Airflow and Prefect. Validation results can be stored in multiple formats (HTML, JSON, zipped archives), facilitating easy storage and sharing.
Supports case-by-case validation and custom validation logic encapsulated within classes or functions.
Built-in support for rendering data quality insights and summaries helps in quickly identifying and addressing validation issues.

Cons:

Steep learning curve, initial setup and configuration can be complex, particularly for intricate validation scenarios. Likely too bulky for our use case. Might be a good option in the future if the validation needs to become more robust.
Seems quite structured-data based, which is not our case.
While highly flexible, creating and maintaining custom expectations and validations can be time-consuming and require a deep understanding of the framework's internals.
With the complexity of the validation rules and the conversion of raw data files to something like Pandas that is more integrated with Great Expectations, there might be a need for additional scripting overhead to handle various integration scenarios.

Cerberus https://docs.python-cerberus.org/index.html

Pretty similar to Pydantic (schema-based), but has a smaller ecosystem and simpler validation rules. Operates only on dictionaries, so we may have to do extra porting since we have varying data structures from different files. Doesn’t support modeling like Pydantic. With Pydantic already being used for dataset config validation, it makes sense to just pick Pydantic over this.

Marshmallow https://marshmallow.readthedocs.io/en/stable/
Pydantic, but with a heavier emphasis on data serialization. Probably just use Pydantic.

Pandera https://pandera.readthedocs.io/en/stable/dtype_validation.html

Schema-based data validation, but based on Pandas DataFrames. Many of our files are not trivially converted to Pandas DataFrames, so this does not suit our use case. Closest thing to this that would be worth using is probably Pydantic.

Reporting Framework Comparison

Allure https://allurereport.org/docs/pytest/

Pros:

Provides UI interface for exploring test results
Generates detailed and visually appealing reports with a clear presentation of test results, including failed and succeeded tests.
Hierarchical view of test results, showing the structure of test suites and individual test cases.
Highly customizable with options to add custom steps, descriptions, and tags to test cases.
Supports historical data, allowing us to view trends and changes over time.
Filtering capabilities to focus on specific test results or test categories.

Cons:

Initial setup can be complex, requiring additional dependencies and configuration.
Can be resource-intensive, especially for large test suites with extensive logging and attachments.

pytest-html https://github.com/pytest-dev/pytest-html

Pros:

Simple and quick to set up, with no complex configurations needed.
Generates standalone HTML reports that are easy to share and view in a web browser.
Supports customization of the report’s appearance and content through hooks and plugins.
Allows embedding of screenshots, logs, and other attachments.
Basic filtering capabilities to focus on specific test results.

Cons:

While customizable, the default reports are more basic compared to Allure’s comprehensive reports.
Limited support for historical data and advanced filtering.
Can become less manageable if our test suite grows.

built-in Pytest JUnitXML feature

Pros:

Generates XML reports that are easy to integrate and parse. Straightforward setup with minimal configuration required.
Well-supported by many CI/CD tools, including Jenkins, GitLab, and CircleCI. Standard format for test results, making it easy to integrate with various tools and platforms.
Minimal performance overhead, suitable for us if our test suite grows larger.

Cons:

Provides less detailed and visually appealing reports compared to other frameworks like Allure. The XML format can be quite tedious to read, would very likely have to integrate with something like Jenkins, which is likely too much of an effort for now. Possibly a long-term goal though.
No hierarchical view or advanced filtering capabilities within the reports themselves.

ReportPortal https://reportportal.io/docs/log-data-in-reportportal/test-framework-integration/Python/pytest/

Pros:

UI interface with ability to look at trends, historical data, failure types.
Customizable dashboard views / widget, allows for more niche / case-by-case visualizations

Cons:

Requires user account setup w/ a token
A full ecosystem we're adopting, might be overkill (similar to Jenkins)

BrowserStack https://www.browserstack.com/docs/test-observability/quick-start/pytest

Pros:

UI interface that contains test results and logs, with real-time monitoring
Structured reports, ability to filter
Can be easily integrated with GH workflow & other CI/CD pipelines

Cons:

Requires user account setup w/ a token
Has a free tier, but more extensive features are on a paid service
More of a focus towards web & mobile testing, nevertheless supports PyTest integration as well, but with just less of a priority
A full ecosystem we're adopting, might be overkill (similar to Jenkins)

Calliope Pro https://docs.calliope.pro/supported-tools/pytest/

Pros:

Simpler integration, just takes XML data generated from PyTest and prettifies it
Also has detailed and structured reports to look at logs, failures.

Cons:

Not as feature rich ReportPortal / BrowserStack, but similarly requires user account & a limited free tier

chanzuckerberg / cryoet-data-portal