Async clickhouse migrations

Any release that involves a schema migration that mutates the old data such as the DROP column, index causes the migrations to fail frequently. This creates a dirty version issue which requires a manual intervention. This approach is not scalable when there are hundreds of tenants and it is not good for our OSS users who don't know how to address the issue. The collector does not get upgraded when the migrator fails.

the collector insertions fail because the half-finished schema is not compatible with the old collector version
the migrator continues to fail because of the dirty version
when we drop the schema_migrations to address the dirty migration it triggers all of the past mutations. For example, the 10th migration of logs drops the original tokenbf index and creates a new index with ngram. Now, say some migration that came after the 10th migration fails. The way we resolve this today is by dropping the schema_migrations table and running them again. Now the 10th migration runs again which drop the index on ngram and recreates on ngram again. The migrator can potentially fail here itself if the time taken to drop the index is beyond 180 seconds.
The mutations triggered block even simple DDL queries that are as simple as CREATE DATABASE
Then we intervene to kill the mutations which are not guaranteed to be killed immediately making the ingestion affected deterministically.

Our internal instances of failures are known from the recent 0.49.1 but the same happened for 0.47 traces migration too and here are the past instances of community users getting affected because of this.

Upgrade to v0.47.0 https://signoz-community.slack.com/archives/C01HWQ1R0BC/p1717682395257909
Upgrade to v0.49.0 https://signoz-community.slack.com/archives/C01HWQ1R0BC/p1720054405725319

SigNoz / signoz

Async clickhouse migrations #5433