astronomer / airflow-provider-great-expectations

Great Expectations Airflow operator
http://greatexpectations.io
Apache License 2.0
159 stars 54 forks source link

Parallel GreatExpectationsOperator tasks corrupt great_expectations.yml #136

Open antelmoa opened 6 months ago

antelmoa commented 6 months ago

I have several GreatExpectationOperator tasks running concurrently. I would get an error often letting me know that the great_expectations.yml file could not be parsed. Inspecting the file, I noticed the file would normally have at the end of the file ult: false. It corresponded with the parsing error in the Airflow logs page of the task.

I suspect a race condition is happening when the great_expectations.yml file is generated due to needing to update the datasource section for a job. After changing GreatExpectationOperator tasks to run one at a time, the great_expectations.yml file has not been corrupted any longer.

aaguilarguero commented 6 months ago

Hi, is this an issue that maintainers are aware of? Is there some sort of workaround for this? Possibly preventing the great_expectations.yml from being overwritten when a GreatExpectationOperator task is run in Airflow?

pankajastro commented 6 months ago

Hey @antelmoa I haven't encountered the issue you mentioned. Could you please share the error stack trace? Additionally, it would be great if you could provide some details on how you're using it. This information would be helpful for debugging. Thank you!

antelmoa commented 6 months ago

Hi, sorry for the wait on this.

Could you please share the error stack trace? Additionally, it would be great if you could provide some details on how you're using it.

This is in my local environment. I have a DAG that is running three GreatExpectationsOperator tasks at the same time. Please see my attached image. Screenshot 2024-03-22 at 4 50 25 PM

If you notice the runs on the left, two runs were successful, but then the latest run failed. The failed tasks had the following error:

[2024-03-22, 20:49:57 UTC] {{base.py:1716}} ERROR - Error while processing DataContextConfig: n_result
[2024-03-22, 20:49:57 UTC] {{base.py:145}} ERROR - Encountered errors during loading config.  See ValidationError for more details.
[2024-03-22, 20:49:57 UTC] {{taskinstance.py:2728}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/venv/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 444, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/venv/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/venv/lib/python3.9/site-packages/great_expectations_provider/operators/great_expectations.py", line 586, in execute
    self.data_context = ge.data_context.FileDataContext(
  File "/venv/lib/python3.9/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 66, in __init__
    self._project_config = self._init_project_config(project_config)
  File "/venv/lib/python3.9/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 111, in _init_project_config
    project_config = FileDataContext._load_file_backed_project_config(
  File "/venv/lib/python3.9/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 213, in _load_file_backed_project_config
    return DataContextConfig.from_commented_map(
  File "/venv/lib/python3.9/site-packages/great_expectations/data_context/types/base.py", line 139, in from_commented_map
    config: Union[dict, BYC] = schema_instance.load(commented_map)
  File "/venv/lib/python3.9/site-packages/marshmallow/schema.py", line 722, in load
    return self._do_load(
  File "/venv/lib/python3.9/site-packages/marshmallow/schema.py", line 908, in _do_load
    self.handle_error(exc, data, many=many, partial=partial)
  File "/venv/lib/python3.9/site-packages/great_expectations/data_context/types/base.py", line 1717, in handle_error
    raise gx_exceptions.InvalidDataContextConfigError(
great_expectations.exceptions.exceptions.InvalidDataContextConfigError: Error while processing DataContextConfig: n_result

I went to the great_expectations.yml file and these were the contents at the end of the file:

anonymous_usage_statistics:
  data_context_id: 23027672-4b48-4dc5-9b58-59d10d75164a
  enabled: false
notebooks:
include_rendered_content:
  globally: false
  expectation_suite: false
  expectation_validation_result: false
n_result: false

Not the n_result:false at the end which is what the stack trace mentioned as the issue. I suspect since this file gets updated dynamically, eventually the tasks corrupt the configuration.