TL;DR I think there isn’t any super convincing case to switch away from Pytest + Allure right now and it’s better to probably keep things simple and reduce code rewrites. Definitely open to any suggestions and opinions though!
Needs
Ease of integration: We can run the validation on an EC2 instance using the staging bucket, or locally, and then store the test results. This is pretty much already handled with our current ingestion_tools.common.fs script.
Flexibility: The data/file type and structure varies significantly across different files for different objects (tomograms, tiltseries, annotations, etc.) It doesn’t seem feasible / necessary to create a full-blown schema to model the data: instead, it seems more logical to work on a case-by-case basis for different kinds of files, parse the data into an object (possibly an instance of a class) and then use it for various tests. The framework should allow us to use whatever necessary other libraries to easily read files and extract the relevant data to validate.
Extensibility: We want to be able to easily add new, custom validation rules as requirements change.
Reporting: We should be able to report the errors in a concise, organized manner.
Hierarchical view for test results (failed & succeeded)
Historical data (nice to have)
Filter (nice to have)
Overall, I don’t think any Python framework cleanly handles the various file types we might receive for different objects. Regardless, we have to do the file reading / parsing, so we should just write it from scratch so that it’s as customizable and lightweight as possible, and then integrate it with a lightweight, well-supported framework so that there’s no unnecessary boilerplate / overhead.
Schema-based validation like Pydantic, Marshmallow, and Cerberus may be good options to consider if we’re thinking of doing heavy inter-file checking across different objects’ files (like comparing that run data lines up with tiltseries and aligned files and tomograms, which each intertwine also). But as of now, I’m leaning towards more of a KISS approach (where Pytest is likely sufficient to do all the checking). But then again, I’m new to all of this, so maybe I’ll only be able to find out what was the better choice once we try building it. Right now, it feels unnecessary to transform s3 file data into a schema structure right now (there would really only be one-two levels of classes, one being the container of all the possible objects, and the other being the objects themselves).
In regard to reporting, Allure gives a good balance of complexity, a not-too-steep learning curve, reporting, hierarchical view, and room for future customization. Utz has already done some work to set it up and it matches the medium-sized scope of our task for s3 data validation quite well, so I think it’s a good choice to continue with moving forward.
Widely used for testing in Python, with strong community support and extensive documentation.
Supports fixtures and parameterized tests, making it easy to set up reusable test configurations, reusing fetched file data values.
Flexible and allows integration with various libraries to handle different file types.
Lightweight and easy to set up, making it ideal for projects where you want to avoid unnecessary boilerplate and overhead.
Extensive plugin ecosystem for extending functionality, including reporting plugins like (pytest-html, pytest-json-output, pytest-allure-adaptor, etc.)
Currently what Utz uses for the existing s3 data validation.
Cons:
Primarily designed for testing code, not specifically for file validation.
May require significant customization to validate complex file formats and metadata.
Not inherently focused on data validation, so additional effort is needed to tailor it for this purpose.
Lightweight and integrates easily with existing Python code.
Can easily be used alongside other libraries for file reading and parsing.
Data validation and settings management using Python type annotations, providing clear and concise validation rules.
Supports complex nested data structures and custom validators.
Integrates well with popular data manipulation libraries like Pandas.
Can be customized to generate standardized reports with detailed error messages.
Cons:
Schema-based, which may not be ideal for our use case, considering that a schema might be too much overhead (it may just be one model for one type of file, and then a larger superclass model that does cross-file checking).
Smaller ecosystem and fewer built-in features compared to more comprehensive validation frameworks.
Also prioritizes serialization / deserialization, which is something not that relevant for just raw data validation.
Integrates with popular data orchestration tools like Airflow and Prefect. Validation results can be stored in multiple formats (HTML, JSON, zipped archives), facilitating easy storage and sharing.
Supports case-by-case validation and custom validation logic encapsulated within classes or functions.
Built-in support for rendering data quality insights and summaries helps in quickly identifying and addressing validation issues.
Cons:
Steep learning curve, initial setup and configuration can be complex, particularly for intricate validation scenarios. Likely too bulky for our use case. Might be a good option in the future if the validation needs to become more robust.
Seems quite structured-data based, which is not our case.
While highly flexible, creating and maintaining custom expectations and validations can be time-consuming and require a deep understanding of the framework's internals.
With the complexity of the validation rules and the conversion of raw data files to something like Pandas that is more integrated with Great Expectations, there might be a need for additional scripting overhead to handle various integration scenarios.
Pretty similar to Pydantic (schema-based), but has a smaller ecosystem and simpler validation rules. Operates only on dictionaries, so we may have to do extra porting since we have varying data structures from different files. Doesn’t support modeling like Pydantic. With Pydantic already being used for dataset config validation, it makes sense to just pick Pydantic over this.
Schema-based data validation, but based on Pandas DataFrames. Many of our files are not trivially converted to Pandas DataFrames, so this does not suit our use case. Closest thing to this that would be worth using is probably Pydantic.
Simple and quick to set up, with no complex configurations needed.
Generates standalone HTML reports that are easy to share and view in a web browser.
Supports customization of the report’s appearance and content through hooks and plugins.
Allows embedding of screenshots, logs, and other attachments.
Basic filtering capabilities to focus on specific test results.
Cons:
While customizable, the default reports are more basic compared to Allure’s comprehensive reports.
Limited support for historical data and advanced filtering.
Can become less manageable if our test suite grows.
built-in Pytest JUnitXML feature
Pros:
Generates XML reports that are easy to integrate and parse. Straightforward setup with minimal configuration required.
Well-supported by many CI/CD tools, including Jenkins, GitLab, and CircleCI. Standard format for test results, making it easy to integrate with various tools and platforms.
Minimal performance overhead, suitable for us if our test suite grows larger.
Cons:
Provides less detailed and visually appealing reports compared to other frameworks like Allure. The XML format can be quite tedious to read, would very likely have to integrate with something like Jenkins, which is likely too much of an effort for now. Possibly a long-term goal though.
No hierarchical view or advanced filtering capabilities within the reports themselves.
AWS S3 Data Validation Proposal
TL;DR I think there isn’t any super convincing case to switch away from Pytest + Allure right now and it’s better to probably keep things simple and reduce code rewrites. Definitely open to any suggestions and opinions though!
Needs
Ease of integration: We can run the validation on an EC2 instance using the staging bucket, or locally, and then store the test results. This is pretty much already handled with our current ingestion_tools.common.fs script.
Flexibility: The data/file type and structure varies significantly across different files for different objects (tomograms, tiltseries, annotations, etc.) It doesn’t seem feasible / necessary to create a full-blown schema to model the data: instead, it seems more logical to work on a case-by-case basis for different kinds of files, parse the data into an object (possibly an instance of a class) and then use it for various tests. The framework should allow us to use whatever necessary other libraries to easily read files and extract the relevant data to validate.
Extensibility: We want to be able to easily add new, custom validation rules as requirements change.
Reporting: We should be able to report the errors in a concise, organized manner.
Overall, I don’t think any Python framework cleanly handles the various file types we might receive for different objects. Regardless, we have to do the file reading / parsing, so we should just write it from scratch so that it’s as customizable and lightweight as possible, and then integrate it with a lightweight, well-supported framework so that there’s no unnecessary boilerplate / overhead.
Schema-based validation like Pydantic, Marshmallow, and Cerberus may be good options to consider if we’re thinking of doing heavy inter-file checking across different objects’ files (like comparing that run data lines up with tiltseries and aligned files and tomograms, which each intertwine also). But as of now, I’m leaning towards more of a KISS approach (where Pytest is likely sufficient to do all the checking). But then again, I’m new to all of this, so maybe I’ll only be able to find out what was the better choice once we try building it. Right now, it feels unnecessary to transform s3 file data into a schema structure right now (there would really only be one-two levels of classes, one being the container of all the possible objects, and the other being the objects themselves).
In regard to reporting, Allure gives a good balance of complexity, a not-too-steep learning curve, reporting, hierarchical view, and room for future customization. Utz has already done some work to set it up and it matches the medium-sized scope of our task for s3 data validation quite well, so I think it’s a good choice to continue with moving forward.
Testing Framework Comparison
Pytest https://docs.pytest.org/en/stable/
Pros:
Cons:
Pydantic https://docs.pydantic.dev/latest/
Pros:
Cons:
Great Expectations https://docs.greatexpectations.io/docs/oss/about/
Pros:
Cons:
Cerberus https://docs.python-cerberus.org/index.html
Pretty similar to Pydantic (schema-based), but has a smaller ecosystem and simpler validation rules. Operates only on dictionaries, so we may have to do extra porting since we have varying data structures from different files. Doesn’t support modeling like Pydantic. With Pydantic already being used for dataset config validation, it makes sense to just pick Pydantic over this.
Marshmallow https://marshmallow.readthedocs.io/en/stable/
Pydantic, but with a heavier emphasis on data serialization. Probably just use Pydantic.
Pandera https://pandera.readthedocs.io/en/stable/dtype_validation.html
Schema-based data validation, but based on Pandas DataFrames. Many of our files are not trivially converted to Pandas DataFrames, so this does not suit our use case. Closest thing to this that would be worth using is probably Pydantic.
Reporting Framework Comparison
Allure https://allurereport.org/docs/pytest/
Pros:
Cons:
pytest-html https://github.com/pytest-dev/pytest-html
Pros:
Cons:
built-in Pytest JUnitXML feature
Pros:
Cons:
ReportPortal https://reportportal.io/docs/log-data-in-reportportal/test-framework-integration/Python/pytest/
Pros:
Cons:
BrowserStack https://www.browserstack.com/docs/test-observability/quick-start/pytest
Pros:
Cons:
Calliope Pro https://docs.calliope.pro/supported-tools/pytest/
Pros:
Cons: