bryzgaloff / airflow-clickhouse-plugin

The most popular ClickHouse plugin for Airflow. 🔝 Top-1% downloads on PyPI: https://pypi.org/project/airflow-clickhouse-plugin! Based on mymarilyn/clickhouse-driver.
MIT License
144 stars 36 forks source link

Adding query_id attr #67

Closed 1ng4lipt closed 1 year ago

1ng4lipt commented 1 year ago

To identify different queries with similar query_id or initial_query_id via system.query_log table you can have an ability to set query_id by yourself using query_id param in Airflow operators you using.

It is possible by clickhouse-driver execute() method - https://clickhouse-driver.readthedocs.io/en/latest/api.html?highlight=query_id#clickhouse_driver.Client.execute_iter

This commit allows set query_id param as whatever you want, f.e. query_id = DAG_ID + TASK_ID + EXECUTION_DATE and then simply identify a query to build lineage. This can be helpful in situations when you manage equally named tables and the only way to identify their queries is to use a modfied query_id.

ne1r0n commented 1 year ago

Hi @1ng4lipt, Just in case, I would recommend you to look at log_comment, unless of course you have done this yet. It looks like it does what you need, the question is how to specify it. One of the options is just to add this to the end of your query, like SELECT .. FROM .. SETTINGS log_comment = 'my comment'

bryzgaloff commented 1 year ago

Hi @ne1r0n thank you for a suggestion: definitely looks useable for ClickHouse users. Speaking about our plugin, its sole purpose is to provide a wrapper for clickhouse-driver in a form of a Airflow plugin. So, we will not introduce any additional logic into the operator's implementation: just pass the arguments to the driver. A user may specify SETTINGS log_comment = … themselves as a part of query argument.

I would suggest generalize this PR to include all the existing execute's arguments or limit only to a full support of query_id as suggested by the author, and implement support for the remaining attributes in a separate PR.

1ng4lipt commented 1 year ago

Hi @bryzgaloff, @ne1r0n and thank you for a suggestions and participation in reviewing my PR

@bryzgaloff answers on your questions from here

  1. Works for 0.2.4+
  2. As you can see in image below normalized_query_hash, query_id and initial_query_id are identical for two equal queries and this is a normal state for ClickHouse. I would recommend to all users in case of building a lineage to add Airflow execution_date to query_id param to identify queries you need. Another suggestion to identify queries with identical query_id is to use max(event_time) statemant in WHERE clause.

Знімок екрана з 2023-08-07 18-37-28

  1. I'll add params shortly
  2. Could you please explain more about this point? Should I modify test by adding a params from ClickHouseOperator.__init__? https://github.com/bryzgaloff/airflow-clickhouse-plugin/blob/368f69a96961e9ce9cea3a0a430b20d6e5628c04/tests/unit/test_operator.py#L20
bryzgaloff commented 1 year ago

Hi @1ng4lipt thank you for an update, I plan to review and merge this week. I am busy working on v1.0.0 (major release) of the plugin but to not delay query_id delivery, I will merge your contribution first.

1ng4lipt commented 1 year ago

Hi @bryzgaloff could you please clarify a date you plan to merge my PR? I badly need this functionality on my prod env Thank you)

bryzgaloff commented 1 year ago

Hi @1ng4lipt I have decided to complete my work on v1.0.0, it has query_id functionality: #68. As a quick workaround you may git pull it and install locally. The code is finished, I need to update README and release the new version. I plan to finish it this week.

bryzgaloff commented 1 year ago

Hi @1ng4lipt I have released https://github.com/bryzgaloff/airflow-clickhouse-plugin/releases/tag/v1.0.0 which introduces query_id among the other arguments of Client.execute. Thank you for an inspiration to do the huge refactoring and patience waiting for the release 🤝

You are in the contributors list now 😉