Integration of Great Expectations with Airflow DAGs and Unit Testing[WIP]

AnthonyByansi commented 3 weeks ago

WHAT DOES THIS PR DO?

[X] This pull request integrates gx {Great Expectations} into the codebase to enhance data quality checks within Airflow DAGs
[X] Tt includes the necessary structure and unit tests for the DAGs to ensure reliable workflows...

WHAT ISSUES ARE RELATED TO THIS PR? AirQo's Data Quality Checks for data profiling process

Jira cards

OPS-224, OPS-225, OPS-226, OPS-227, OPS-228, OPS-229, OPS-230

Proposed file tree 👇

airqo-api/
│
├── dags/
│   ├── __init__.py
│   ├── 
│   └── 
│
├── tests/
│   ├── __init__.py
│   ├── test_dags.py
│   └── ... (other test files)
│
├── great_expectations/
│   ├── expectations/
│   │   ├── my_suite/
│   │   │   └── my_expectation_suite.json
│   │   └── ... (other expectation suites)
│   ├── great_expectations.yml
│   ├── checkpoints/
│   │   └── ... (checkpoints configuration)
│   ├── plugins/
│   │   └── ... {custom plugins if any}
│   ├── uncommitted/
│   │   ├── config_variables.yml
│   │   └── ...
│   └── validations/
│       └── ... {Our validation results}
│
├── data/
│   └── ... {sample or test data files collected from ThingSpeak}
│
├── requirements.txt
├── setup.py
├── README.md
└──

Summary by CodeRabbit

Documentation
- Added setup and usage instructions for integrating Great Expectations with the AirQo-api project in gx/README.md.
New Features
- Introduced expectation suites for air quality data to ensure completeness, referential integrity, schema validation, and uniqueness.
- Added configurations for Great Expectations, including datasources and validation stores.
Chores
- Added .gitignore to exclude unnecessary files from the repository.
- Included development dependencies in gx/dev-requirements.txt.

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 30.13%. Comparing base (214d559) to head (78cfd26). Report is 96 commits behind head on staging.

Additional details and impacted files

[![Impacted file tree graph](https://app.codecov.io/gh/airqo-platform/AirQo-api/pull/3231/graphs/tree.svg?width=650&height=150&src=pr&token=HHq3qS3cL6&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=airqo-platform)](https://app.codecov.io/gh/airqo-platform/AirQo-api/pull/3231?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=airqo-platform) ```diff @@ Coverage Diff @@ ## staging #3231 +/- ## =========================================== - Coverage 30.30% 30.13% -0.18% =========================================== Files 184 184 Lines 24487 24634 +147 Branches 3205 3227 +22 =========================================== + Hits 7421 7423 +2 - Misses 16952 17096 +144 - Partials 114 115 +1 ``` [see 3 files with indirect coverage changes](https://app.codecov.io/gh/airqo-platform/AirQo-api/pull/3231/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=airqo-platform)

coderabbitai[bot] commented 3 weeks ago

Walkthrough

## Walkthrough The recent updates enhance the AirQo-api project with the integration of Great Expectations for data validation. These changes introduce configurations, expectation suites for air quality datasets, and a setup guide. Additionally, development dependencies are specified to ensure compatibility with Apache Airflow, Great Expectations, and other key tools. ## Changes | File/Path | Change Summary | |-------------------------------------------------------------|----------------------------------------------------------------------------------------------------------| | `gx/.gitignore` | Adds rules to ignore files and directories in the Git repository, including Python files, data, logs, Jupyter Notebooks, and OS-specific files. | | `gx/README.md` | Provides setup and usage instructions for integrating Great Expectations with the AirQo-api project, including directory structure, getting started steps, creating expectation suites, running validations, viewing results, and an example with Airflow DAG. | | `gx/dev-requirements.txt` | Introduces a set of dependencies for the project, including key tools like Apache Airflow, Great Expectations, pytest, SQLAlchemy, pandas-gbq, and Google Cloud BigQuery. | | `gx/expectations/air_quality_completeness.json` | Defines an expectation suite ensuring specific columns like `station_code`, `timestamp`, `temperature`, `humidity`, and others do not contain null values. | | `gx/expectations/air_quality_referential_integrity.json` | Defines an expectation suite for the "station_code" column to have values within a predefined set. | | `gx/expectations/air_quality_schema_validation.json` | Sets expectations for columns in an air quality dataset, ensuring the presence and correct data types for specific columns. | | `gx/expectations/air_quality_uniqueness_integrity.json` | Defines expectations related to air quality, checking for unique column values, compound column uniqueness, and column data types. | | `gx/great_expectations.yml` | Introduces Great Expectations configuration settings, including datasources, stores, plugins directory, data docs sites, and anonymous usage statistics. | ## Poem > In data's dance, precision reigns, > Expectations set, no room for pains. > Airflows guide the cleansing streams, > Validations safeguard our dreams. > Great is the hope that quality brings, > Now AirQo's data truly sings.

[!TIP]

Early access features: enabled
We are currently testing the following features in early access: - **OpenAI `gpt-4o` model for code reviews and chat**: OpenAI claims that this model is better at understanding and generating code than the previous models. We seek your feedback over the next few weeks before making it generally available. Note: - You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file. - Please join our [Discord Community](https://discord.com/invite/GsXnASn26c) to provide feedback and report issues. - OSS projects are currently opted into early access features by default.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

Tips

### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

airqo-platform / AirQo-api