SigNoz is an open-source observability platform native to OpenTelemetry with logs, traces and metrics in a single application. An open-source alternative to DataDog, NewRelic, etc. 🔥 🖥. 👉 Open source Application Performance Monitoring (APM) & Observability tool
Any release that involves a schema migration that mutates the old data such as the DROP column, index causes the migrations to fail frequently. This creates a dirty version issue which requires a manual intervention. This approach is not scalable when there are hundreds of tenants and it is not good for our OSS users who don't know how to address the issue. The collector does not get upgraded when the migrator fails.
the collector insertions fail because the half-finished schema is not compatible with the old collector version
the migrator continues to fail because of the dirty version
when we drop the schema_migrations to address the dirty migration it triggers all of the past mutations. For example, the 10th migration of logs drops the original tokenbf index and creates a new index with ngram. Now, say some migration that came after the 10th migration fails. The way we resolve this today is by dropping the schema_migrations table and running them again. Now the 10th migration runs again which drop the index on ngram and recreates on ngram again. The migrator can potentially fail here itself if the time taken to drop the index is beyond 180 seconds.
The mutations triggered block even simple DDL queries that are as simple as CREATE DATABASE
Then we intervene to kill the mutations which are not guaranteed to be killed immediately making the ingestion affected deterministically.
Our internal instances of failures are known from the recent 0.49.1 but the same happened for 0.47 traces migration too and here are the past instances of community users getting affected because of this.
Any release that involves a schema migration that mutates the old data such as the DROP column, index causes the migrations to fail frequently. This creates a dirty version issue which requires a manual intervention. This approach is not scalable when there are hundreds of tenants and it is not good for our OSS users who don't know how to address the issue. The collector does not get upgraded when the migrator fails.
schema_migrations
to address the dirty migration it triggers all of the past mutations. For example, the 10th migration of logs drops the originaltokenbf
index and creates a new index withngram
. Now, say some migration that came after the 10th migration fails. The way we resolve this today is by dropping theschema_migrations
table and running them again. Now the 10th migration runs again which drop the index onngram
and recreates onngram
again. The migrator can potentially fail here itself if the time taken to drop the index is beyond 180 seconds.CREATE DATABASE
Our internal instances of failures are known from the recent 0.49.1 but the same happened for 0.47 traces migration too and here are the past instances of community users getting affected because of this.