Closed 1ng4lipt closed 1 year ago
Hi @1ng4lipt,
Just in case, I would recommend you to look at log_comment, unless of course you have done this yet.
It looks like it does what you need, the question is how to specify it.
One of the options is just to add this to the end of your query, like SELECT .. FROM .. SETTINGS log_comment = 'my comment'
Hi @ne1r0n thank you for a suggestion: definitely looks useable for ClickHouse users. Speaking about our plugin, its sole purpose is to provide a wrapper for clickhouse-driver
in a form of a Airflow plugin. So, we will not introduce any additional logic into the operator's implementation: just pass the arguments to the driver. A user may specify SETTINGS log_comment = …
themselves as a part of query
argument.
I would suggest generalize this PR to include all the existing execute
's arguments or limit only to a full support of query_id
as suggested by the author, and implement support for the remaining attributes in a separate PR.
Hi @bryzgaloff, @ne1r0n and thank you for a suggestions and participation in reviewing my PR
@bryzgaloff answers on your questions from here
0.2.4+
normalized_query_hash
, query_id
and initial_query_id
are identical for two equal queries and this is a normal state for ClickHouse. I would recommend to all users in case of building a lineage to add Airflow execution_date
to query_id
param to identify queries you need. Another suggestion to identify queries with identical query_id
is to use max(event_time)
statemant in WHERE
clause.test
by adding a params from ClickHouseOperator.__init__
?
https://github.com/bryzgaloff/airflow-clickhouse-plugin/blob/368f69a96961e9ce9cea3a0a430b20d6e5628c04/tests/unit/test_operator.py#L20Hi @1ng4lipt thank you for an update, I plan to review and merge this week. I am busy working on v1.0.0 (major release) of the plugin but to not delay query_id
delivery, I will merge your contribution first.
Hi @bryzgaloff could you please clarify a date you plan to merge my PR? I badly need this functionality on my prod env Thank you)
Hi @1ng4lipt I have decided to complete my work on v1.0.0, it has query_id
functionality: #68. As a quick workaround you may git pull
it and install locally. The code is finished, I need to update README and release the new version. I plan to finish it this week.
Hi @1ng4lipt I have released https://github.com/bryzgaloff/airflow-clickhouse-plugin/releases/tag/v1.0.0 which introduces query_id
among the other arguments of Client.execute
. Thank you for an inspiration to do the huge refactoring and patience waiting for the release 🤝
You are in the contributors list now 😉
To identify different queries with similar query_id or initial_query_id via system.query_log table you can have an ability to set query_id by yourself using query_id param in Airflow operators you using.
It is possible by clickhouse-driver execute() method - https://clickhouse-driver.readthedocs.io/en/latest/api.html?highlight=query_id#clickhouse_driver.Client.execute_iter
This commit allows set query_id param as whatever you want, f.e. query_id = DAG_ID + TASK_ID + EXECUTION_DATE and then simply identify a query to build lineage. This can be helpful in situations when you manage equally named tables and the only way to identify their queries is to use a modfied query_id.