getsentry / snuba

Search the seas for your lost treasure.
Other
339 stars 57 forks source link

Table columns structure in ZooKeeper is different from local table structure / Missing columns .. while processing query - Errors when running - snuba migrations add-node #6191

Closed chipzzz closed 1 month ago

chipzzz commented 2 months ago

When running something add-node operations to create tables in additional clickhouse nodes(shards) I get local table v. zookeeper table mismatch where zookeeper has additional fields i.e like the timeseries_id. Another case is that the schema has missing columns i.e _error_ids_hashed.

Question I have is, when running add-node do I need to run in local mode if I run snuba in distributed mode? I figured I run both since both _local and _dist tables are present on the original node, but do I need need local tables on additional nodes? How does table replication happen across additional nodes.

Ref: https://github.com/getsentry/snuba/blob/master/MIGRATIONS.md

do        
 for storage in 
  cdc discover events events_ro metrics metrics_summaries migrations outcomes querylog sessions transactions profiles functions replays generic_metrics_sets generic_metrics_distributions search_issues generic_metrics_counters spans group_attributes generic_metrics_gauges profile_chunks;

 do 
snuba migrations add-node --type local   --storage-set $storage  --host-name clickhouse-shard1-$i.clickhouse-headless.sentry-dev.svc.cluster.local  --port 9000 --database default ; 

snuba migrations add-node --type dist   --storage-set $storage  --host-name clickhouse-shard1-$i.clickhouse-headless.sentry-dev.svc.cluster.local  --port 9000 --database default;

done ;

 done

Error 1 - CH : Zookeper mismatch - happens for multiple tables

        {"module": "snuba.migrations.operations", "event": "Executing CREATE TABLE IF NOT EXISTS metrics_raw_v2_local (use_case_id LowCardinality(String), org_id UInt64, project_id UInt64, metric_id UInt64, timestamp DateTime, tags Nested(key UInt64, value UInt64), metric_type LowCardinality(String), set_values Array(UInt64), count_value Float64, distribution_values Array(Float64), materialization_version UInt8, retention_days UInt16, partition UInt16, offset UInt64) ENGINE ReplicatedMergeTree('/clickhouse/tables/metrics/{shard}/default/metrics_raw_v2_local', '{replica}') ORDER BY (use_case_id, metric_type, org_id, project_id, metric_id, timestamp) PARTITION BY (toStartOfInterval(timestamp, INTERVAL 3 day)) TTL timestamp + toIntervalDay(7);", "severity": "info", "timestamp": "2024-08-08T20:34:47.292434Z"}
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/clickhouse/native.py", line 200, in execute
    result_data = query_execute()
                  ^^^^^^^^^^^^^^^
  File "/usr/src/snuba/snuba/clickhouse/native.py", line 183, in query_execute
    return conn.execute(  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 373, in execute
    rv = self.process_ordinary_query(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 571, in process_ordinary_query
    return self.receive_result(with_column_types=with_column_types,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 204, in receive_result
    return result.get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/result.py", line 50, in get_result
    for packet in self.packet_generator:
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 220, in packet_generator
    packet = self.receive_packet()
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 237, in receive_packet
    raise packet.exception
clickhouse_driver.errors.ServerException: Code: 122.
DB::Exception: Table columns structure in ZooKeeper is different from local table structure. Local columns:
columns format version: 1
15 columns:
`use_case_id` LowCardinality(String)
`org_id` UInt64
`project_id` UInt64
`metric_id` UInt64
`timestamp` DateTime
`tags.key` Array(UInt64)
`tags.value` Array(UInt64)
`metric_type` LowCardinality(String)
`set_values` Array(UInt64)
`count_value` Float64
`distribution_values` Array(Float64)
`materialization_version` UInt8
`retention_days` UInt16
`partition` UInt16
`offset` UInt64

Zookeeper columns:
columns format version: 1
16 columns:
`use_case_id` LowCardinality(String)
`org_id` UInt64
`project_id` UInt64
`metric_id` UInt64
`timestamp` DateTime
`tags.key` Array(UInt64)
`tags.value` Array(UInt64)
`metric_type` LowCardinality(String)
`set_values` Array(UInt64)
`count_value` Float64
`distribution_values` Array(Float64)
`materialization_version` UInt8
`retention_days` UInt16
`partition` UInt16
`offset` UInt64
`timeseries_id` UInt32
. Stack trace:

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe3b83d5 in /opt/bitnami/clickhouse/bin/clickhouse
1. ? @ 0x8b3b13e in /opt/bitnami/clickhouse/bin/clickhouse
2. DB::StorageReplicatedMergeTree::checkTableStructure(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&) @ 0x1436490c in /opt/bitnami/clickhouse/bin/clickhouse
3. DB::StorageReplicatedMergeTree::StorageReplicatedMergeTree(String const&, String const&, bool, DB::StorageID const&, String const&, DB::StorageInMemoryMetadata const&, std::shared_ptr<DB::Context>, String const&, DB::MergeTreeData::MergingParams const&, std::unique_ptr<DB::MergeTreeSettings, std::default_delete<DB::MergeTreeSettings>>, bool, DB::RenamingRestrictions) @ 0x1435a8d5 in /opt/bitnami/clickhouse/bin/clickhouse
4. ? @ 0x14b44142 in /opt/bitnami/clickhouse/bin/clickhouse
5. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, bool) const @ 0x1429205b in /opt/bitnami/clickhouse/bin/clickhouse
6. DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&) @ 0x139777d1 in /opt/bitnami/clickhouse/bin/clickhouse
7. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x13970542 in /opt/bitnami/clickhouse/bin/clickhouse
8. DB::InterpreterCreateQuery::execute() @ 0x1397ccb4 in /opt/bitnami/clickhouse/bin/clickhouse
9. ? @ 0x13efde87 in /opt/bitnami/clickhouse/bin/clickhouse
10. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x13efb40d in /opt/bitnami/clickhouse/bin/clickhouse
11. DB::TCPHandler::runImpl() @ 0x14cd1c11 in /opt/bitnami/clickhouse/bin/clickhouse
12. DB::TCPHandler::run() @ 0x14ce7719 in /opt/bitnami/clickhouse/bin/clickhouse
13. Poco::Net::TCPServerConnection::start() @ 0x17c4ef54 in /opt/bitnami/clickhouse/bin/clickhouse
14. Poco::Net::TCPServerDispatcher::run() @ 0x17c5017b in /opt/bitnami/clickhouse/bin/clickhouse
15. Poco::PooledThread::run() @ 0x17dcd527 in /opt/bitnami/clickhouse/bin/clickhouse
16. Poco::ThreadImpl::runnableEntry(void*) @ 0x17dcaf5d in /opt/bitnami/clickhouse/bin/clickhouse
17. start_thread @ 0x7ea7 in /lib/x86_64-linux-gnu/libpthread-2.31.so
18. __clone @ 0xfca2f in /lib/x86_64-linux-gnu/libc-2.31.so

Error 2 - Missing required schema fields - happens for multiple tables

{"module": "snuba.migrations.operations", "event": "Executing CREATE TABLE IF NOT EXISTS replays_local (replay_id UUID, event_hash UUID, segment_id Nullable(UInt16), trace_ids Array(UUID), _trace_ids_hashed Array(UInt64) MATERIALIZED arrayMap(t -> cityHash64(t), trace_ids), title String, project_id UInt64, timestamp DateTime, platform LowCardinality(String), environment LowCardinality(Nullable(String)), release Nullable(String), dist Nullable(String), ip_address_v4 Nullable(IPv4), ip_address_v6 Nullable(IPv6), user String, user_id Nullable(String), user_name Nullable(String), user_email Nullable(String), sdk_name String, sdk_version String, tags Nested(key String, value String), retention_days UInt16, partition UInt16, offset UInt64) ENGINE ReplicatedReplacingMergeTree('/clickhouse/tables/replays/{shard}/default/replays_local', '{replica}') ORDER BY (project_id, toStartOfDay(timestamp), cityHash64(replay_id), event_hash) PARTITION BY (retention_days, toMonday(timestamp)) TTL timestamp + toIntervalDay(retention_days) SETTINGS index_granularity=8192;", "severity": "info", "timestamp": "2024-08-08T20:35:51.701174Z"}
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/clickhouse/native.py", line 200, in execute
    result_data = query_execute()
                  ^^^^^^^^^^^^^^^
  File "/usr/src/snuba/snuba/clickhouse/native.py", line 183, in query_execute
    return conn.execute(  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 373, in execute
    rv = self.process_ordinary_query(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 571, in process_ordinary_query
    return self.receive_result(with_column_types=with_column_types,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 204, in receive_result
    return result.get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/result.py", line 50, in get_result
    for packet in self.packet_generator:
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 220, in packet_generator
    packet = self.receive_packet()
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 237, in receive_packet
    raise packet.exception
clickhouse_driver.errors.ServerException: Code: 47.
DB::Exception: Missing columns: '_error_ids_hashed' while processing query: '_error_ids_hashed', required columns: '_error_ids_hashed' '_error_ids_hashed'. Stack trace:

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe3b83d5 in /opt/bitnami/clickhouse/bin/clickhouse
1. ? @ 0x8b5f18c in /opt/bitnami/clickhouse/bin/clickhouse
2. DB::TreeRewriterResult::collectUsedColumns(std::shared_ptr<DB::IAST> const&, bool, bool) @ 0x13e5eb79 in /opt/bitnami/clickhouse/bin/clickhouse
3. DB::TreeRewriter::analyze(std::shared_ptr<DB::IAST>&, DB::NamesAndTypesList const&, std::shared_ptr<DB::IStorage const>, std::shared_ptr<DB::StorageSnapshot> const&, bool, bool, bool, bool) const @ 0x13e67b98 in /opt/bitnami/clickhouse/bin/clickhouse
4. DB::IndexDescription::getIndexFromAST(std::shared_ptr<DB::IAST> const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) @ 0x14234414 in /opt/bitnami/clickhouse/bin/clickhouse
5. DB::IndicesDescription::parse(String const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) @ 0x142354c1 in /opt/bitnami/clickhouse/bin/clickhouse
6. DB::ReplicatedMergeTreeTableMetadata::checkEquals(DB::ReplicatedMergeTreeTableMetadata const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) const @ 0x14b329ca in /opt/bitnami/clickhouse/bin/clickhouse
7. DB::StorageReplicatedMergeTree::checkTableStructure(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&) @ 0x143643a9 in /opt/bitnami/clickhouse/bin/clickhouse
8. DB::StorageReplicatedMergeTree::StorageReplicatedMergeTree(String const&, String const&, bool, DB::StorageID const&, String const&, DB::StorageInMemoryMetadata const&, std::shared_ptr<DB::Context>, String const&, DB::MergeTreeData::MergingParams const&, std::unique_ptr<DB::MergeTreeSettings, std::default_delete<DB::MergeTreeSettings>>, bool, DB::RenamingRestrictions) @ 0x1435a8d5 in /opt/bitnami/clickhouse/bin/clickhouse
9. ? @ 0x14b44142 in /opt/bitnami/clickhouse/bin/clickhouse
10. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, bool) const @ 0x1429205b in /opt/bitnami/clickhouse/bin/clickhouse
11. DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&) @ 0x139777d1 in /opt/bitnami/clickhouse/bin/clickhouse
12. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x13970542 in /opt/bitnami/clickhouse/bin/clickhouse
13. DB::InterpreterCreateQuery::execute() @ 0x1397ccb4 in /opt/bitnami/clickhouse/bin/clickhouse
14. ? @ 0x13efde87 in /opt/bitnami/clickhouse/bin/clickhouse
15. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x13efb40d in /opt/bitnami/clickhouse/bin/clickhouse
16. DB::TCPHandler::runImpl() @ 0x14cd1c11 in /opt/bitnami/clickhouse/bin/clickhouse
17. DB::TCPHandler::run() @ 0x14ce7719 in /opt/bitnami/clickhouse/bin/clickhouse
18. Poco::Net::TCPServerConnection::start() @ 0x17c4ef54 in /opt/bitnami/clickhouse/bin/clickhouse
19. Poco::Net::TCPServerDispatcher::run() @ 0x17c5017b in /opt/bitnami/clickhouse/bin/clickhouse
20. Poco::PooledThread::run() @ 0x17dcd527 in /opt/bitnami/clickhouse/bin/clickhouse
21. Poco::ThreadImpl::runnableEntry(void*) @ 0x17dcaf5d in /opt/bitnami/clickhouse/bin/clickhouse
22. start_thread @ 0x7ea7 in /lib/x86_64-linux-gnu/libpthread-2.31.so
23. __clone @ 0xfca2f in /lib/x86_64-linux-gnu/libc-2.31.so

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/snuba/snuba/cli/migrations.py", line 372, in add_node
    Runner.add_node(
  File "/usr/src/snuba/snuba/migrations/runner.py", line 548, in add_node
    op.execute_new_node(storage_sets, node_type, clickhouse)
  File "/usr/src/snuba/snuba/migrations/operations.py", line 595, in execute_new_node
    clickhouse.execute(sql)
  File "/usr/src/snuba/snuba/clickhouse/native.py", line 283, in execute
    raise ClickhouseError(e.message, code=e.code) from e
snuba.clickhouse.errors.ClickhouseError: DB::Exception: Missing columns: '_error_ids_hashed' while processing query: '_error_ids_hashed', required columns: '_error_ids_hashed' '_error_ids_hashed'. Stack trace:

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe3b83d5 in /opt/bitnami/clickhouse/bin/clickhouse
1. ? @ 0x8b5f18c in /opt/bitnami/clickhouse/bin/clickhouse
2. DB::TreeRewriterResult::collectUsedColumns(std::shared_ptr<DB::IAST> const&, bool, bool) @ 0x13e5eb79 in /opt/bitnami/clickhouse/bin/clickhouse
3. DB::TreeRewriter::analyze(std::shared_ptr<DB::IAST>&, DB::NamesAndTypesList const&, std::shared_ptr<DB::IStorage const>, std::shared_ptr<DB::StorageSnapshot> const&, bool, bool, bool, bool) const @ 0x13e67b98 in /opt/bitnami/clickhouse/bin/clickhouse
4. DB::IndexDescription::getIndexFromAST(std::shared_ptr<DB::IAST> const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) @ 0x14234414 in /opt/bitnami/clickhouse/bin/clickhouse
5. DB::IndicesDescription::parse(String const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) @ 0x142354c1 in /opt/bitnami/clickhouse/bin/clickhouse
6. DB::ReplicatedMergeTreeTableMetadata::checkEquals(DB::ReplicatedMergeTreeTableMetadata const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) const @ 0x14b329ca in /opt/bitnami/clickhouse/bin/clickhouse
7. DB::StorageReplicatedMergeTree::checkTableStructure(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&) @ 0x143643a9 in /opt/bitnami/clickhouse/bin/clickhouse
8. DB::StorageReplicatedMergeTree::StorageReplicatedMergeTree(String const&, String const&, bool, DB::StorageID const&, String const&, DB::StorageInMemoryMetadata const&, std::shared_ptr<DB::Context>, String const&, DB::MergeTreeData::MergingParams const&, std::unique_ptr<DB::MergeTreeSettings, std::default_delete<DB::MergeTreeSettings>>, bool, DB::RenamingRestrictions) @ 0x1435a8d5 in /opt/bitnami/clickhouse/bin/clickhouse
9. ? @ 0x14b44142 in /opt/bitnami/clickhouse/bin/clickhouse
10. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, bool) const @ 0x1429205b in /opt/bitnami/clickhouse/bin/clickhouse
11. DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&) @ 0x139777d1 in /opt/bitnami/clickhouse/bin/clickhouse
12. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x13970542 in /opt/bitnami/clickhouse/bin/clickhouse
13. DB::InterpreterCreateQuery::execute() @ 0x1397ccb4 in /opt/bitnami/clickhouse/bin/clickhouse
14. ? @ 0x13efde87 in /opt/bitnami/clickhouse/bin/clickhouse
15. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x13efb40d in /opt/bitnami/clickhouse/bin/clickhouse
16. DB::TCPHandler::runImpl() @ 0x14cd1c11 in /opt/bitnami/clickhouse/bin/clickhouse
17. DB::TCPHandler::run() @ 0x14ce7719 in /opt/bitnami/clickhouse/bin/clickhouse
18. Poco::Net::TCPServerConnection::start() @ 0x17c4ef54 in /opt/bitnami/clickhouse/bin/clickhouse
19. Poco::Net::TCPServerDispatcher::run() @ 0x17c5017b in /opt/bitnami/clickhouse/bin/clickhouse
20. Poco::PooledThread::run() @ 0x17dcd527 in /opt/bitnami/clickhouse/bin/clickhouse
21. Poco::ThreadImpl::runnableEntry(void*) @ 0x17dcaf5d in /opt/bitnami/clickhouse/bin/clickhouse
22. start_thread @ 0x7ea7 in /lib/x86_64-linux-gnu/libpthread-2.31.so
23. __clone @ 0xfca2f in /lib/x86_64-linux-gnu/libc-2.31.so

2024-08-08 20:35:53,327 Initializing Snuba...
2024-08-08 20:35:55,100 Snuba initialization took 1.7672806810587645s
{"module": "snuba.migrations.operations", "event": "Executing CREATE TABLE IF NOT EXISTS replays_dist (replay_id UUID, event_hash UUID, segment_id Nullable(UInt16), trace_ids Array(UUID), _trace_ids_hashed Array(UInt64) MATERIALIZED arrayMap(t -> cityHash64(t), trace_ids), title String, project_id UInt64, timestamp DateTime, platform LowCardinality(String), environment LowCardinality(Nullable(String)), release Nullable(String), dist Nullable(String), ip_address_v4 Nullable(IPv4), ip_address_v6 Nullable(IPv6), user String, user_id Nullable(String), user_name Nullable(String), user_email Nullable(String), sdk_name String, sdk_version String, tags Nested(key String, value String), retention_days UInt16, partition UInt16, offset UInt64) ENGINE Distributed(`default`, default, replays_local, cityHash64(replay_id));", "severity": "info", "timestamp": "2024-08-08T20:35:55.290359Z"}
{"module": "snuba.migrations.operations", "event": "Executing ALTER TABLE replays_dist ADD COLUMN IF NOT EXISTS url String AFTER title;", "severity": "info", "timestamp": "2024-08-08T20:35:55.310891Z"}
{"module": "snuba.migrations.operations", "event": "Executing ALTER TABLE replays_dist MODIFY COLUMN url Nullable(String);", "severity": "info", "timestamp": "2024-08-08T20:35:55.324555Z"}
{"module": "snuba.migrations.operations", "event": "Executing ALTER TABLE replays_dist ADD COLUMN IF NOT EXISTS error_ids Array(UUID) AFTER url;", "severity": "info", "timestamp": "2024-08-08T20:35:55.338164Z"}
{"module": "snuba.migrations.operations", "event": "Executing ALTER TABLE replays_dist ADD COLUMN IF NOT EXISTS _error_ids_hashed Array(UInt64) MATERIALIZED arrayMap(t -> cityHash64(t), error_ids) AFTER error_ids;", "severity": "info", "timestamp": "2024-08-08T20:35:55.350291Z"}
chipzzz commented 2 months ago

It sounds like after running add-node I should run the regular snuba migrations migrate --force on the additional nodes to get necessary migration job steps completed. Hopefully it will let me and not say that a migration has already been completed for that group

https://github.com/getsentry/snuba/blob/master/snuba/snuba_migrations/metrics/0035_metrics_raw_timeseries_id.py#L11-L14

chipzzz commented 2 months ago

Without performing any additional actions in both scenarios timeseries_id and _error_ids_hashed were later found in their respective tables on the new node. How does that happen?

At first I thought due to the exception, the tables were not created but in fact they are! And checking their fields, they look normal.

chipzzz commented 2 months ago

Correction** some tables are in fact missing

chipzzz commented 2 months ago

Rerunning on one of the missing nodes for the missing table in this case transactions_local still complains about missing fields

snuba@sentry-snuba-api-59554c5fbc-z9ks6:/usr/src/snuba$ snuba migrations add-node --type local   --storage-set transactions  --host-name clickhouse-shard1-1.clickhouse-headless.sentry-dev.svc.cluster.local  --port 9000 -
-database default 
2024-08-08 22:53:34,733 Initializing Snuba...
2024-08-08 22:53:36,318 Snuba initialization took 1.5789565416052938s
{"module": "snuba.migrations.operations", "event": "Executing CREATE TABLE IF NOT EXISTS transactions_local (project_id UInt64, event_id UUID, trace_id UUID, span_id UInt64, transaction_name LowCardinality(String), transaction_hash UInt64 MATERIALIZED cityHash64(transaction_name), transaction_op LowCardinality(String), transaction_status UInt8 DEFAULT 2, start_ts DateTime, start_ms UInt16, finish_ts DateTime, finish_ms UInt16, duration UInt32, platform LowCardinality(String), environment LowCardinality(Nullable(String)), release LowCardinality(Nullable(String)), dist LowCardinality(Nullable(String)), sdk_name LowCardinality(String) DEFAULT '', sdk_version LowCardinality(String) DEFAULT '', ip_address_v4 Nullable(IPv4), ip_address_v6 Nullable(IPv6), user String DEFAULT '', user_hash UInt64 MATERIALIZED cityHash64(user), user_id Nullable(String), user_name Nullable(String), user_email Nullable(String), tags Nested(key String, value String), _tags_flattened String, contexts Nested(key String, value String), _contexts_flattened String, partition UInt16, offset UInt64, message_timestamp DateTime, retention_days UInt16, deleted UInt8) ENGINE ReplicatedReplacingMergeTree('/clickhouse/tables/transactions/{shard}/default/transactions_local', '{replica}', deleted) ORDER BY (project_id, toStartOfDay(finish_ts), transaction_name, cityHash64(span_id)) PARTITION BY (retention_days, toMonday(finish_ts)) SAMPLE BY cityHash64(span_id) TTL finish_ts + toIntervalDay(retention_days) SETTINGS index_granularity=8192;", "severity": "info", "timestamp": "2024-08-08T22:53:36.506090Z"}
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/clickhouse/native.py", line 200, in execute
    result_data = query_execute()
                  ^^^^^^^^^^^^^^^
  File "/usr/src/snuba/snuba/clickhouse/native.py", line 183, in query_execute
    return conn.execute(  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 373, in execute
    rv = self.process_ordinary_query(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 571, in process_ordinary_query
    return self.receive_result(with_column_types=with_column_types,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 204, in receive_result
    return result.get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/result.py", line 50, in get_result
    for packet in self.packet_generator:
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 220, in packet_generator
    packet = self.receive_packet()
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clickhouse_driver/client.py", line 237, in receive_packet
    raise packet.exception
clickhouse_driver.errors.ServerException: Code: 47.
DB::Exception: Missing columns: 'timestamp' while processing query: 'timestamp', required columns: 'timestamp' 'timestamp'. Stack trace:

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe3b83d5 in /opt/bitnami/clickhouse/bin/clickhouse
1. ? @ 0x8b5f18c in /opt/bitnami/clickhouse/bin/clickhouse
2. DB::TreeRewriterResult::collectUsedColumns(std::shared_ptr<DB::IAST> const&, bool, bool) @ 0x13e5eb79 in /opt/bitnami/clickhouse/bin/clickhouse
3. DB::TreeRewriter::analyze(std::shared_ptr<DB::IAST>&, DB::NamesAndTypesList const&, std::shared_ptr<DB::IStorage const>, std::shared_ptr<DB::StorageSnapshot> const&, bool, bool, bool, bool) const @ 0x13e67b98 in /opt/bitnami/clickhouse/bin/clickhouse
4. DB::IndexDescription::getIndexFromAST(std::shared_ptr<DB::IAST> const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) @ 0x14234414 in /opt/bitnami/clickhouse/bin/clickhouse
5. DB::IndicesDescription::parse(String const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) @ 0x142354c1 in /opt/bitnami/clickhouse/bin/clickhouse
6. DB::ReplicatedMergeTreeTableMetadata::checkEquals(DB::ReplicatedMergeTreeTableMetadata const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) const @ 0x14b329ca in /opt/bitnami/clickhouse/bin/clickhouse
7. DB::StorageReplicatedMergeTree::checkTableStructure(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&) @ 0x143643a9 in /opt/bitnami/clickhouse/bin/clickhouse
8. DB::StorageReplicatedMergeTree::StorageReplicatedMergeTree(String const&, String const&, bool, DB::StorageID const&, String const&, DB::StorageInMemoryMetadata const&, std::shared_ptr<DB::Context>, String const&, DB::MergeTreeData::MergingParams const&, std::unique_ptr<DB::MergeTreeSettings, std::default_delete<DB::MergeTreeSettings>>, bool, DB::RenamingRestrictions) @ 0x1435a8d5 in /opt/bitnami/clickhouse/bin/clickhouse
9. ? @ 0x14b44142 in /opt/bitnami/clickhouse/bin/clickhouse
10. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, bool) const @ 0x1429205b in /opt/bitnami/clickhouse/bin/clickhouse
11. DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&) @ 0x139777d1 in /opt/bitnami/clickhouse/bin/clickhouse
12. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x13970542 in /opt/bitnami/clickhouse/bin/clickhouse
13. DB::InterpreterCreateQuery::execute() @ 0x1397ccb4 in /opt/bitnami/clickhouse/bin/clickhouse
14. ? @ 0x13efde87 in /opt/bitnami/clickhouse/bin/clickhouse
15. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x13efb40d in /opt/bitnami/clickhouse/bin/clickhouse
16. DB::TCPHandler::runImpl() @ 0x14cd1c11 in /opt/bitnami/clickhouse/bin/clickhouse
17. DB::TCPHandler::run() @ 0x14ce7719 in /opt/bitnami/clickhouse/bin/clickhouse
18. Poco::Net::TCPServerConnection::start() @ 0x17c4ef54 in /opt/bitnami/clickhouse/bin/clickhouse
19. Poco::Net::TCPServerDispatcher::run() @ 0x17c5017b in /opt/bitnami/clickhouse/bin/clickhouse
20. Poco::PooledThread::run() @ 0x17dcd527 in /opt/bitnami/clickhouse/bin/clickhouse
21. Poco::ThreadImpl::runnableEntry(void*) @ 0x17dcaf5d in /opt/bitnami/clickhouse/bin/clickhouse
22. start_thread @ 0x7ea7 in /lib/x86_64-linux-gnu/libpthread-2.31.so
23. __clone @ 0xfca2f in /lib/x86_64-linux-gnu/libc-2.31.so

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/snuba/snuba/cli/migrations.py", line 372, in add_node
    Runner.add_node(
  File "/usr/src/snuba/snuba/migrations/runner.py", line 548, in add_node
    op.execute_new_node(storage_sets, node_type, clickhouse)
  File "/usr/src/snuba/snuba/migrations/operations.py", line 595, in execute_new_node
    clickhouse.execute(sql)
  File "/usr/src/snuba/snuba/clickhouse/native.py", line 283, in execute
    raise ClickhouseError(e.message, code=e.code) from e
snuba.clickhouse.errors.ClickhouseError: DB::Exception: Missing columns: 'timestamp' while processing query: 'timestamp', required columns: 'timestamp' 'timestamp'. Stack trace:

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0xe3b83d5 in /opt/bitnami/clickhouse/bin/clickhouse
1. ? @ 0x8b5f18c in /opt/bitnami/clickhouse/bin/clickhouse
2. DB::TreeRewriterResult::collectUsedColumns(std::shared_ptr<DB::IAST> const&, bool, bool) @ 0x13e5eb79 in /opt/bitnami/clickhouse/bin/clickhouse
3. DB::TreeRewriter::analyze(std::shared_ptr<DB::IAST>&, DB::NamesAndTypesList const&, std::shared_ptr<DB::IStorage const>, std::shared_ptr<DB::StorageSnapshot> const&, bool, bool, bool, bool) const @ 0x13e67b98 in /opt/bitnami/clickhouse/bin/clickhouse
4. DB::IndexDescription::getIndexFromAST(std::shared_ptr<DB::IAST> const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) @ 0x14234414 in /opt/bitnami/clickhouse/bin/clickhouse
5. DB::IndicesDescription::parse(String const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) @ 0x142354c1 in /opt/bitnami/clickhouse/bin/clickhouse
6. DB::ReplicatedMergeTreeTableMetadata::checkEquals(DB::ReplicatedMergeTreeTableMetadata const&, DB::ColumnsDescription const&, std::shared_ptr<DB::Context const>) const @ 0x14b329ca in /opt/bitnami/clickhouse/bin/clickhouse
7. DB::StorageReplicatedMergeTree::checkTableStructure(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&) @ 0x143643a9 in /opt/bitnami/clickhouse/bin/clickhouse
8. DB::StorageReplicatedMergeTree::StorageReplicatedMergeTree(String const&, String const&, bool, DB::StorageID const&, String const&, DB::StorageInMemoryMetadata const&, std::shared_ptr<DB::Context>, String const&, DB::MergeTreeData::MergingParams const&, std::unique_ptr<DB::MergeTreeSettings, std::default_delete<DB::MergeTreeSettings>>, bool, DB::RenamingRestrictions) @ 0x1435a8d5 in /opt/bitnami/clickhouse/bin/clickhouse
9. ? @ 0x14b44142 in /opt/bitnami/clickhouse/bin/clickhouse
10. DB::StorageFactory::get(DB::ASTCreateQuery const&, String const&, std::shared_ptr<DB::Context>, std::shared_ptr<DB::Context>, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, bool) const @ 0x1429205b in /opt/bitnami/clickhouse/bin/clickhouse
11. DB::InterpreterCreateQuery::doCreateTable(DB::ASTCreateQuery&, DB::InterpreterCreateQuery::TableProperties const&, std::unique_ptr<DB::DDLGuard, std::default_delete<DB::DDLGuard>>&) @ 0x139777d1 in /opt/bitnami/clickhouse/bin/clickhouse
12. DB::InterpreterCreateQuery::createTable(DB::ASTCreateQuery&) @ 0x13970542 in /opt/bitnami/clickhouse/bin/clickhouse
13. DB::InterpreterCreateQuery::execute() @ 0x1397ccb4 in /opt/bitnami/clickhouse/bin/clickhouse
14. ? @ 0x13efde87 in /opt/bitnami/clickhouse/bin/clickhouse
15. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x13efb40d in /opt/bitnami/clickhouse/bin/clickhouse
16. DB::TCPHandler::runImpl() @ 0x14cd1c11 in /opt/bitnami/clickhouse/bin/clickhouse
17. DB::TCPHandler::run() @ 0x14ce7719 in /opt/bitnami/clickhouse/bin/clickhouse
18. Poco::Net::TCPServerConnection::start() @ 0x17c4ef54 in /opt/bitnami/clickhouse/bin/clickhouse
19. Poco::Net::TCPServerDispatcher::run() @ 0x17c5017b in /opt/bitnami/clickhouse/bin/clickhouse
20. Poco::PooledThread::run() @ 0x17dcd527 in /opt/bitnami/clickhouse/bin/clickhouse
21. Poco::ThreadImpl::runnableEntry(void*) @ 0x17dcaf5d in /opt/bitnami/clickhouse/bin/clickhouse
22. start_thread @ 0x7ea7 in /lib/x86_64-linux-gnu/libpthread-2.31.so
23. __clone @ 0xfca2f in /lib/x86_64-linux-gnu/libc-2.31.so
chipzzz commented 2 months ago

So as per my comment here and per the help menu from add-node

All of the SQL operations for the provided storage sets will be run. Any non
  SQL (Python) operations will be skipped.

How do I rerun the "python migrations" that have additional steps that may have been missed by add-node? If I do something like

CLICKHOUSE_HOST=clickhouse-shard1-1.clickhouse-headless.sentry-dev.svc.cluster.local snuba migrations  migrate -g transactions --force

I get

Finished running migrations

Do I have reverse all migrations for transactions group and rerun them all on the problematic node? Is there a way to reverse an entire group migration rather than a single migration step?

... Sure I can just add the missing timestamp to the table but I want to be holistic and run all the missing steps in the "python migration operations"

MeredithAnya commented 2 months ago

hi @chipzzz

Are you running the migration commands on the new node before you are adding it to the cluster? Because add-node basically runs all the migrations in order, the initial create table statement won't match the current state of the other tables.

An alternative to creating tables on new nodes with add-node is to copy the create table statements from a replica and then run that one the new node. I don't think we have documentation for this and it's been a while since I verified this but we did add a copy tables script https://github.com/getsentry/snuba/blob/master/scripts/copy_tables.py that was supposed to help with this. I don't believe this will work for adding a new shard though since the zookeeper has to change for the new shard.

Let me know if I missed something from your comments, and I'll take another look.

chipzzz commented 2 months ago

@MeredithAnya , that's 1 thing I didn't try - deploying node without connecting it to the cluster and running the commands. I will have to revisit this. However, I think would run into more issues rebalancing clickhouse https://clickhouse.com/docs/en/guides/sre/scaling-clusters. That's a whole another problem.

For now I opted out of shards(causing lots of problems during migrations) and just using replicas. When I find the need to scale I'll revisit this. But I agree, replicating & sharding clickhouse alongside sentry is a mess and not well documented.

onkar commented 1 month ago

@chipzzz feel free to open a PR on the documentation change. Closing this issue now.