apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.74k stars 14.22k forks source link

Publish JSON schema for airflow.cfg #42850

Open ghjklw opened 1 week ago

ghjklw commented 1 week ago

Description

There is already a good structured YAML file providing metadata about all valid configuration options in airflow.cfg: airflow/config_templates/config.yml.

I think publishing the same data as a JSON schema and eventually to https://www.schemastore.org/json/ could be very useful.

Use case/motivation

Airflow won't complain if the configuration file contains a typo or a non-existent configuration key making it easy to make mistakes. It could also make it easier to catch invalid values earlier.

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

boring-cyborg[bot] commented 1 week ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

potiuk commented 1 week ago

The (small) problem is that airflow.cfg file is not json. It's 'ini" format. I am not sure if you can validate such format easily. Do you know any tools that can do it and tested it with Airlfow .cfg file @ghjklw ?

Also be aware that we are planning (as part of https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-67+Multi-team+deployment+of+Airflow+components to migrate the format from ".ini" format to ".toml" format which is de-facto standard for configuration for many python projects now. Will that work with it? Any tools that can do it?

Maybe it should be made as part of that move and maybe you would like to contribute to that effort and actually take part in the .toml conversion and adding validation for the toml file @ghjklw ?

potiuk commented 1 week ago

BTW. I know you mentioned "even better toml", but I am asking about CLI tools - somethign that can be used in our pre-commits ad validate the schema in CI. The big problem with such tooling that is IDE-only - is that we are not able to verify if such schema is actually "correct" and validating config files generated automatically during testing would be a good test.

ghjklw commented 5 days ago

Hi @potiuk

My mistake for assuming airflow.cfg was toml and not ini 🙈

Regarding the tooling for JSON schema with TOML, a fairly easy alternative relying only on largely used robust projects/stdlib would be to read the toml file as a dict using tomllib.load and then validating the dict using jsonschema.validate which actually validates a mapping/dictionary/object and not a string.

See also: https://python-jsonschema.readthedocs.io/en/stable/faq/#can-jsonschema-be-used-to-validate-yaml-toml-etc

An even more powerful solution, but which might require more work depending on how the configuration is implemented today would be to leverage pydantic-settings. We would define the configuration as Pydantic models, creating the JSON schema would be straightforward. Pydantic could handle itself the parsing of the TOML file through the TomlConfigSettingsSource. An added benefit of that approach is that it would create an abstraction layer between the definition of the settings structure and the format they're stored in/how they're parsed. It would then be quite easy to use YAML/JSON... pydantic-settings can also take care of variables defined through environment variables.

Last but not least, check-jsonschema has support for TOML. It can be used both as a CLI tool and as a pre-commit hook.

Unfortunately, I really do not have the bandwidth nor the experience with Airflow's development to offer my help with the implementation, but if anyone wants to work on it, I'd be happy to be a sparring partner/help with testing.

potiuk commented 4 days ago

Marked it as "good first issue" - hopefully will pick it up