getsentry / snuba

Search the seas for your lost treasure.
Other
339 stars 56 forks source link

snuba-transactions-consumer error after upgrade to 22.5.0 #2768

Closed ruiengana closed 2 years ago

ruiengana commented 2 years ago

Environment

How do you use Sentry? self-hosted/on-premise (22.5.0)

Steps to Reproduce

Upgrade from 22.3.0 to 22.5.0

Expected Result

snuba-transactions-consumer to process events without schema errors

Actual Result

snuba-transactions-consumer is unable to process events due to schema errors

022-05-25 12:45:46,446 Consumer Starting
2022-05-25 12:45:46,448 librdkafka log level: 6
2022-05-25 12:45:46,501 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 23109}
2022-05-25 12:46:27,594 Caught {
  "__type__": "SerializableException",
  "__name__": "ClickhouseWriterError",
  "__message__": "No such column _tags_flattened in table default.transactions_local: Insertion status:",
  "__should_report__": true,
  "__extra_data__": {
    "code": 16,
    "row": null
  }
}, shutting down...
2022-05-25 12:46:27,594 Terminating <snuba.consumers.consumer.ProcessedMessageBatchWriter object at 0x7fa9abb29430>...
2022-05-25 12:46:27,594 Closing <snuba.utils.streams.kafka_consumer_with_commit_log.KafkaConsumerWithCommitLog object at 0x7fa9abaabaf0>...
2022-05-25 12:46:27,598 Partitions revoked: [Partition(topic=Topic(name='events'), index=0)]
2022-05-25 12:46:27,603 Processor terminated
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 188, in consumer
    consumer.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 121, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 156, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/streaming/filter.py", line 26, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/streaming/transform.py", line 63, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/streaming/collect.py", line 135, in poll
    self.close_and_reset_batch()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/streaming/collect.py", line 199, in close_and_reset_batch
    self.future.result(timeout=5)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/streaming/collect.py", line 210, in __finish_batch
    batch.close()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/streaming/collect.py", line 66, in close
    self.__step.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 284, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 100, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 274, in write
    batch.join()
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 222, in join
    raise ClickhouseWriterError(message, code=code, row=row)
snuba.clickhouse.errors.ClickhouseWriterError: No such column _tags_flattened in table default.transactions_local: Insertion status:
ruiengana commented 2 years ago

See https://github.com/getsentry/snuba/pull/2506

onewland commented 2 years ago

Are you up to date on migrations? What is the output of snuba migrations migrate on a server of yours?

Also, could you share the output of running select * from migrations_local where group = 'transactions'; in a clickhouse console?

ruiengana commented 2 years ago

All migrations up to date. I have same result for all 3 nodes.

$ snuba migrations migrate
Finished running migrations
sentry-clickhouse :) select * from migrations_local where group = 'transactions';

SELECT *
FROM migrations_local
WHERE group = 'transactions'

┌─group────────┬─migration_id────────────────────────────────────────────┬───────────timestamp─┬─status────┬─version─┐
│ transactions │ 0001_transactions                                       │ 2022-04-22 01:14:37 │ completed │       2 │
│ transactions │ 0002_transactions_onpremise_fix_orderby_and_partitionby │ 2022-04-22 01:14:37 │ completed │       1 │
│ transactions │ 0003_transactions_onpremise_fix_columns                 │ 2022-04-22 01:14:40 │ completed │       2 │
│ transactions │ 0004_transactions_add_tags_hash_map                     │ 2022-04-22 01:14:40 │ completed │       2 │
│ transactions │ 0005_transactions_add_measurements                      │ 2022-04-22 01:14:40 │ completed │       2 │
│ transactions │ 0006_transactions_add_http_fields                       │ 2022-04-22 01:14:41 │ completed │       2 │
│ transactions │ 0007_transactions_add_discover_cols                     │ 2022-04-22 01:14:42 │ completed │       2 │
│ transactions │ 0008_transactions_add_timestamp_index                   │ 2022-04-22 01:14:42 │ completed │       2 │
│ transactions │ 0009_transactions_fix_title_and_message                 │ 2022-04-22 01:14:43 │ completed │       2 │
│ transactions │ 0010_transactions_nullable_trace_id                     │ 2022-04-22 01:14:43 │ completed │       2 │
│ transactions │ 0011_transactions_add_span_op_breakdowns                │ 2022-04-22 01:14:43 │ completed │       2 │
│ transactions │ 0012_transactions_add_spans                             │ 2022-04-22 01:14:44 │ completed │       2 │
│ transactions │ 0013_transactions_reduce_spans_exclusive_time           │ 2022-04-22 01:14:45 │ completed │       2 │
│ transactions │ 0014_transactions_remove_flattened_columns              │ 2022-05-23 14:59:30 │ completed │       2 │
└──────────────┴─────────────────────────────────────────────────────────┴─────────────────────┴───────────┴─────────┘

14 rows in set. Elapsed: 0.003 sec.

sentry-clickhouse :)

Any ideas?

onewland commented 2 years ago

Can you share the message at the current offset of the events topic (kcat should be a useful tool for this) for the snuba-transactions-consumer group? Feel free to redact any sensitive data.

onewland commented 2 years ago

You should be able to get partition and offset with a command like

[appuser@07ae8e5f5df3 ~]$ kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group transactions_group

GROUP              TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
transactions_group events          0          59912           59912           0               rdkafka-9c734817-ccc9-48dc-aba7-c434db04f925 /172.18.0.8     rdkafka
ruiengana commented 2 years ago
$ kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group transactions_group

Consumer group 'transactions_group' has no active members.

GROUP              TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
transactions_group events          0          23109           25404           2295            -               -               -

I see a large lag in the events topic. I guess the problem is that there is data from previous version pending to be processed which still depend of the _tags_flattened column, right?

Will try to reset offset and see if consumer recovers...

ruiengana commented 2 years ago

The reset didn't worked :(

Here is output of topic

$ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events
[2,"insert",{"group_id":null,"event_id":"c9bcd858624b47c58eabf4585e65a819","organization_id":1,"project_id":1,"message":"sentry.tasks.send_ping","platform":"python","datetime":"2022-05-26T23:25:58.445421Z","data":{"event_id":"c9bcd858624b47c58eabf4585e65a819","level":"info","version":"7","type":"transaction","transaction":"sentry.tasks.send_ping","logger":"","platform":"python","timestamp":1653607558.445421,"start_timestamp":1653607558.432148,"received":1653607558.464062,"release":"backend@22.5.0+ffc60a7493ca649ceba3282770feecc563ec3713","environment":"production","contexts":{"runtime":{"name":"CPython","version":"3.8.12","build":"3.8.12 (default, Mar  2 2022, 05:04:35) \n[GCC 10.2.1 20210110]","type":"runtime"},"trace":{"trace_id":"71a77850b5e143ebaa5d0cd045cf960e","span_id":"8b027e5e08f367e4","parent_span_id":"b8a2f6b705d21081","op":"celery.task","status":"ok","type":"trace"}},"tags":[["celery_task_id","922daf44-ae43-473c-a7d6-261ee705e80e"],["status","ok"],["task_name","sentry.tasks.send_ping"],["transaction_id",null],["server_name","sentry-worker-745c8fc65f-jztj9"],["level","info"],["environment","production"],["transaction","sentry.tasks.send_ping"],["sentry:release","backend@22.5.0+ffc60a7493ca649ceba3282770feecc563ec3713"],["runtime","CPython 3.8.12"],["runtime.name","CPython"]],"extra":{"celery-job":{"args":[],"kwargs":{},"task_name":"sentry.tasks.send_ping"},"sys.argv":["/usr/local/bin/sentry","run","worker"]},"sdk":{"name":"sentry.python","version":"1.5.12","integrations":["argv","atexit","celery","dedupe","django","django_atomic","excepthook","logging","modules","redis","rust_info","stdlib","threading"],"packages":[{"name":"pypi:sentry-sdk","version":"1.5.12"}]},"key_id":"1","project":1,"grouping_config":{"enhancements":"eJybzDRxY05qemJypZWRgaGlroGxrqHRBABbKAcD","id":"legacy:2019-03-12"},"spans":[{"timestamp":1653607558.438662,"start_timestamp":1653607558.433545,"description":"connect","op":"db","span_id":"b559b0f5c5716e47","parent_span_id":"8b027e5e08f367e4","trace_id":"71a77850b5e143ebaa5d0cd045cf960e","same_process_as_parent":true,"tags":null,"data":null},{"timestamp":1653607558.441642,"start_timestamp":1653607558.438813,"description":"UPDATE \"sentry_option\" SET \"value\" = %s, \"last_updated\" = %s WHERE \"sentry_option\".\"key\" = %s","op":"db","span_id":"8ff6c2187e5a3967","parent_span_id":"8b027e5e08f367e4","trace_id":"71a77850b5e143ebaa5d0cd045cf960e","same_process_as_parent":true,"tags":null,"data":null},{"timestamp":1653607558.443394,"start_timestamp":1653607558.442432,"description":"UPDATE \"sentry_option\" SET \"value\" = %s, \"last_updated\" = %s WHERE \"sentry_option\".\"key\" = %s","op":"db","span_id":"b4f4288883b5bc00","parent_span_id":"8b027e5e08f367e4","trace_id":"71a77850b5e143ebaa5d0cd045cf960e","same_process_as_parent":true,"tags":null,"data":null}],"_metrics":{"bytes.ingested.event":2203,"bytes.stored.event":2847},"culprit":"sentry.tasks.send_ping","metadata":{"title":"sentry.tasks.send_ping","location":"sentry.tasks.send_ping"},"title":"sentry.tasks.send_ping","location":"sentry.tasks.send_ping","nodestore_insert":1653607559.544942},"primary_hash":"d41d8cd98f00b204e9800998ecf8427e","retention_days":90},{"is_new":false,"is_regression":false,"is_new_group_environment":false,"skip_consume":false}]
[2,"insert",{"group_id":null,"event_id":"ca0145028cbb41ceb0417e918ba94161","organization_id":1,"project_id":1,"message":"sentry.tasks.send_ping","platform":"python","datetime":"2022-05-26T23:26:58.466999Z","data":{"event_id":"ca0145028cbb41ceb0417e918ba94161","level":"info","version":"7","type":"transaction","transaction":"sentry.tasks.send_ping","logger":"","platform":"python","timestamp":1653607618.466999,"start_timestamp":1653607618.441819,"received":1653607618.542113,"release":"backend@22.5.0+ffc60a7493ca649ceba3282770feecc563ec3713","environment":"production","contexts":{"runtime":{"name":"CPython","version":"3.8.12","build":"3.8.12 (default, Mar  2 2022, 05:04:35) \n[GCC 10.2.1 20210110]","type":"runtime"},"trace":{"trace_id":"fd302db2273d463e81862dea846dcfb9","span_id":"bb33b11d7aaa0977","parent_span_id":"8fa4f603872909c7","op":"celery.task","status":"ok","type":"trace"}},"tags":[["celery_task_id","06652d84-55c7-453f-98c7-e9ebd4611283"],["status","ok"],["task_name","sentry.tasks.send_ping"],["transaction_id",null],["server_name","sentry-worker-745c8fc65f-pwr8q"],["level","info"],["environment","production"],["transaction","sentry.tasks.send_ping"],["sentry:release","backend@22.5.0+ffc60a7493ca649ceba3282770feecc563ec3713"],["runtime","CPython 3.8.12"],["runtime.name","CPython"]],"extra":{"celery-job":{"args":[],"kwargs":{},"task_name":"sentry.tasks.send_ping"},"sys.argv":["/usr/local/bin/sentry","run","worker"]},"sdk":{"name":"sentry.python","version":"1.5.12","integrations":["argv","atexit","celery","dedupe","django","django_atomic","excepthook","logging","modules","redis","rust_info","stdlib","threading"],"packages":[{"name":"pypi:sentry-sdk","version":"1.5.12"}]},"key_id":"1","project":1,"grouping_config":{"enhancements":"eJybzDRxY05qemJypZWRgaGlroGxrqHRBABbKAcD","id":"legacy:2019-03-12"},"spans":[{"timestamp":1653607618.459881,"start_timestamp":1653607618.443081,"description":"connect","op":"db","span_id":"92e22a00eb4fb313","parent_span_id":"bb33b11d7aaa0977","trace_id":"fd302db2273d463e81862dea846dcfb9","same_process_as_parent":true,"tags":null,"data":null},{"timestamp":1653607618.463977,"start_timestamp":1653607618.460231,"description":"UPDATE \"sentry_option\" SET \"value\" = %s, \"last_updated\" = %s WHERE \"sentry_option\".\"key\" = %s","op":"db","span_id":"a10105d8c3ddcba3","parent_span_id":"bb33b11d7aaa0977","trace_id":"fd302db2273d463e81862dea846dcfb9","same_process_as_parent":true,"tags":null,"data":null},{"timestamp":1653607618.466566,"start_timestamp":1653607618.465103,"description":"UPDATE \"sentry_option\" SET \"value\" = %s, \"last_updated\" = %s WHERE \"sentry_option\".\"key\" = %s","op":"db","span_id":"b3fa71b48bdd210d","parent_span_id":"bb33b11d7aaa0977","trace_id":"fd302db2273d463e81862dea846dcfb9","same_process_as_parent":true,"tags":null,"data":null}],"_metrics":{"bytes.ingested.event":2203,"bytes.stored.event":2847},"culprit":"sentry.tasks.send_ping","metadata":{"title":"sentry.tasks.send_ping","location":"sentry.tasks.send_ping"},"title":"sentry.tasks.send_ping","location":"sentry.tasks.send_ping","nodestore_insert":1653607619.623821},"primary_hash":"d41d8cd98f00b204e9800998ecf8427e","retention_days":90},{"is_new":false,"is_regression":false,"is_new_group_environment":false,"skip_consume":false}]
ruiengana commented 2 years ago

@onewland any ideas?

pehlert commented 2 years ago

We are observing the same issue. As a temporary workaround, I have disabled the relevant worker which is not crucial for us right now.

I have also tried to DETACH/ATTACH the transactions table in Clickhouse but to no avail. In our case the data there is luckily not crucial, could be a remedy to DROP/recreate it? I'd be happy to try but lack the Clickhouse/Zookeeper experience unfortunately to figure out how.

nikhars commented 2 years ago

Hi @ruiengana. Can you provide the output of the following command from the container running the snuba transactions consumer. I wonder if the environment is not running with the right version of the code.

snuba --version

ruiengana commented 2 years ago

Hi @ruiengana. Can you provide the output of the following command from the container running the snuba transactions consumer. I wonder if the environment is not running with the right version of the code.

snuba --version

NOTE: It was really hard to run this command because pod is always restarting. You literally have 1 sec window to run it before pod crash.

$ kubectl exec -it sentry-snuba-transactions-consumer-77544c6757-fkk6s -- snuba --version
snuba, version 22.5.0
ruiengana commented 2 years ago

We are observing the same issue. As a temporary workaround, I have disabled the relevant worker which is not crucial for us right now.

I have also tried to DETACH/ATTACH the transactions table in Clickhouse but to no avail. In our case the data there is luckily not crucial, could be a remedy to DROP/recreate it? I'd be happy to try but lack the Clickhouse/Zookeeper experience unfortunately to figure out how.

Can you be more explicit regarding "disabled the relevant worker"?

nikhars commented 2 years ago

I am not quite sure why you are seeing this error. The only thing I can think of is that snuba consumers got upgraded before the migration got applied. I would recommend doing the following and seeing if it helps.

  1. Completely stop/drain the existing snuba-transactions pods/containers.
  2. Start fresh/new snuba-transactions pods/containers.
ruiengana commented 2 years ago

I am not quite sure why you are seeing this error. The only thing I can think of is that snuba consumers got upgraded before the migration got applied. I would recommend doing the following and seeing if it helps.

  1. Completely stop/drain the existing snuba-transactions pods/containers.
  2. Start fresh/new snuba-transactions pods/containers.

I have scaled down the deployment to 0 watched pods terminate and then scaled it back to 1 but still have same problem.

Any other ideas?

ruiengana commented 2 years ago

I manged to resolve this the hard way by having to delete all Kafka and Clickhouse volumes (all except PostgreSQL) and install again to recreate them from scratch.

Would be great to have a command like to delete all events.

xgengsjc2021 commented 2 years ago

@ruiengana Is it safe to delete clickhouse persistent volume(PV)? Any data loss after deleting the PV, since we upgraded from 14 to 15.1.0 and got the same issue. (I think deleting kafka PV should be fine.)

ruiengana commented 2 years ago

Front what I know, Sentry Storage works like:

If you can afford to lose your event data then you should be fine.

fpacifici commented 2 years ago

Postgres hold system configuration

Hi, this is not 100% accurate. In the default self hosted installation, postgres also contains events payload. Though in practice if you delete data from Clickhosue Sentry will not have a way to find the event ids to load from Postgres so you will be fine (your UI will not be broken and data will be gone).

Cleaning up the Clickhouse volume does the trick. As the DDL migration state is also contained in Clickhouse, Snuba will interpret that as "there are no tables" so it will recreate them upon upgrade.

xgengsjc2021 commented 2 years ago

@fpacifici Hi, deleting the pv should be a workaround. Can we figure out the root cause, then we do not need to delete the PV.

https://github.com/getsentry/snuba/blob/master/snuba/migrations/snuba_migrations/transactions/0014_transactions_remove_flattened_columns.py

This script deleted the "_tags_flattened" column before the upgrade, but why will the code read this column after the upgrade?

junowong0114 commented 2 years ago

@xgengsjc2021 I agree with you. I have the same issue after I upgraded from 11.9.0 to 22.7.0. I temporarily solved the problem by running snuba migrations reverse --group transactions --migration-id 0014_transactions_remove_flattened_columns --force in snuba-api container to reverse the deletion of _tags_flattened column.

I am 100% sure I am running snuba 22.7.0, why is the code trying to write to _tags_flattened column after deletion?

xgengsjc2021 commented 2 years ago

@fpacifici @onewland Do you have any plan to fix this one?

junowong0114 commented 2 years ago

@xgengsjc2021 I agree with you. I have the same issue after I upgraded from 11.9.0 to 22.7.0. I temporarily solved the problem by running snuba migrations reverse --group transactions --migration-id 0014_transactions_remove_flattened_columns --force in snuba-api container to reverse the deletion of _tags_flattened column.

I am 100% sure I am running snuba 22.7.0, why is the code trying to write to _tags_flattened column after deletion?

TLDR

I got rid of the error by 1. Reverse the migration 0014_transactions_remove_flattened_columns 2. After waiting for sometime, re-run the migration.

Details

After reversing 0014_transactions_remove_flattened_columns snuba migration, the installation runs without error. The next morning (~10 hours after the upgrade), in an effort to solve this issue, I re-run the migration. I was surprised to see the error is gone.

I have a hypothesis as to what happened. First off, I did not block event ingestion during the upgrade. I suspected the request to write to the _tags_flattened column was generated before the upgrade. When the request reaches the consumer, the _tags_flattened column is already deleted, thus the write action failed. The request is somehow cached in one/some of the upstream components and is constantly retried, causing the consumer to fall into crash looping.

Looking at the event ingestion pipeline from Sentry documentation, the caching should be from Kafka, as only Kafka has a PVC that can be used to cache data. However, I am not an expert on snuba/clickhouse/kafka architecture, maybe someone with more expertise can comment on my hypothesis?

xyfleet commented 1 year ago

@junowong0114 I got the same error here but could not fix it based on your workaround. Tried to run the command but got error in the sentry-snuba-api pod: snuba@sentry-snuba-api-7d55b7cfdf-66gdk:/usr/src/snuba$ snuba migrations reverse --group transactions --migration-id 0014_transactions_remove_flattened_columns --force Error: Subsequent migrations must be reversed first

About the second point: "re-run the migration.". Do you mean re-upgrade sentry again?

@onewland @fpacifici I tried to upgrade the sentry to 15.3.0 directly to pass this snuba issue but still got the same error. (The sentry 15.3.0 or 16.0.0 is using the same version of Snuba which is 22.9.0). Any new release to fix this issue?

junowong0114 commented 1 year ago

@junowong0114 I got the same error here but could not fix it based on your workaround. Tried to run the command but got error in the sentry-snuba-api pod: snuba@sentry-snuba-api-7d55b7cfdf-66gdk:/usr/src/snuba$ snuba migrations reverse --group transactions --migration-id 0014_transactions_remove_flattened_columns --force Error: Subsequent migrations must be reversed first

About the second point: "re-run the migration.". Do you mean re-upgrade sentry again?

@onewland @fpacifici I tried to upgrade the sentry to 15.3.0 directly to pass this snuba issue but still got the same error. (The sentry 15.3.0 or 16.0.0 is using the same version of Snuba which is 22.9.0). Any new release to fix this issue?

I found a better solution, but it requires you to upgrade again from start. I hope you did not upgrade your production deployment directly or it will be difficult to fix.

The new version of snuba by default runs migration in distributed mode. See official documentation on distributed mode. However, the old version of Clickhouse is configured to use local mode, hence the migration will fail miserably.

The best way to solve this problem is by modifying the chart to

  1. Run Snuba migration in local mode
  2. After migration success, revert Snuba to distributed mode.

See this comment for details. Hope this help.

xyfleet commented 1 year ago

@junowong0114 Thank you so much. I will test your solution on my DEV env. (I rollback the sentry to version 14.)

xyfleet commented 1 year ago

@junowong0114 Before testing on my site, I have several questions here. 1:

Configure Snuba to run in single node mode.
Edit sentry/templates/configmap-sentry.yaml

Do you mean configmap-snuba.yaml file, right? (not sentry) Only keep "single_node": True,?

2:

Revert snuba-init job to single-node mode.
Edit sentry/templates/hooks/snuba-db-init.job.yaml
+ - name: CLICKHOUSE_SINGLE_NODE
+   value: "true"

Which file will call this variable to ensure snuba working in local mode(single-node mode)? I searched the whole repo and did not find any place use this variable.

3:


The new version of snuba by default runs migration in distributed mode. See [official documentation on distributed mode](https://getsentry.github.io/snuba/migrations/modes.html). However, the old version of Clickhouse is configured to use local mode, hence the migration will fail miserably.

The best way to solve this problem is by modifying the chart to

Run Snuba migration in local mode
After migration success, revert Snuba to distributed mode.

Based on what you mentioned, is it possible for us to upgrade the ClickHouse to the new version?

junowong0114 commented 1 year ago

@xyfleet

  1. Yes, correct, it is configmap-snuba.yaml, sorry for the typo.
  2. I don't remember precisely where I saw it, CLICKHOUSE_SINGLE_NODE should be an environment variable that configures the Snuba cli to use single-node mode to communicate with Clickhouse. I copied this from the older version of the same configmap.
  3. You can't just upgrade Clickhouse and call it a day. Snuba migration modifies the table schema to make them work in distributed mode. Thus you need to run all migrations to finish in local mode, only after then you can run Snuba in distributed mode.
fpacifici commented 1 year ago

Sorry for the late answer. I missed this as it was closed.

Looking at the event ingestion pipeline from Sentry documentation, the caching should be from Kafka, as only Kafka has a PVC that can be used to cache data. However, I am not an expert on snuba/clickhouse/kafka architecture, maybe someone with more expertise can comment on my hypothesis?

There is no caching involved in the way we write to Clickhouse. The consumer take a message from the Kafka topic and builds a brand new dictionary per row. It does not blindly copy anything from Kafka, it starts from an empty dict and copies individual fields. So the field cannot come from the Kafka topic. Also _tags_flattened was never in Kafka to begin with as it was synthesized by the Snuba consumer between Kafka and Clickhouse.

@onewland @fpacifici I tried to upgrade the sentry to 15.3.0 directly to pass this snuba issue but still got the same error. (The sentry 15.3.0 or 16.0.0 is using the same version of Snuba which is 22.9.0). Any new release to fix this issue?

The code that wrote the _tags_flattened column was removed a long time ago: https://github.com/getsentry/snuba/commit/a5c68d6af964dd0b070c9a01614738f73e680ecb That was in Snuba 20.7.0. Since then the consumer just did not have the code to generate the field for Clickhouse so it is really puzzling that you are getting that error on Snuba 22.9.0 . The error at the beginning of this issue means: "The consumer sent a field named _tags_flattened_ to Clickhouse in one of the rows to write but the column does not exist as it was removed through the migration". No trace existed about that field in the consumer code since 2020.

I think there is some misunderstanding about Sentry releases. Sentry never had a 15.3.0 version. We moved from 10.0.0 to semantic versioning. https://github.com/getsentry/sentry/releases?page=5. Sentry never reached version 15. Could you please provide more context on which release you are referring to ?

Also are you 100% sure the error is the same we were discussing at the beginning of this issue ? snuba.clickhouse.errors.ClickhouseWriterError: No such column _tags_flattened in table default.transactions_local: Insertion status:

Thanks

xyfleet commented 1 year ago

@fpacifici Thanks for the update. Let me clarify my case and answer your question. 1: I installed the sentry through this helm chart https://github.com/sentry-kubernetes/charts/tree/develop/sentry.

sentry chart: 14.1.0
appVersion: 22.5.0
so, the version of snuba in 14.1.0 is 22.5.0
the sentry image tag is: 22.5.0

sentry chart: 15.3.0
appVersion: 22.9.0
So, the version of snuba in 15.3.0 is 22.9.0
the sentry image tag is: 22.9.0

Based on the my previous test, upgrading from chart 14.1.0 to the chart 15.3.0 will get the same error No such column _tags_flattened in table default.transactions_local: Insertion status: At the beginning, the sentry-snuba-transactions-consumer-xxxx pod works fine, just a few seconds, it will fail. The log below show you more details. Please help check. As a result, the sentry-snuba-transactions-consumer-xxxx pod will be in restart cycle forever.

2022-11-14 19:15:53,285 Consumer Starting
2022-11-14 19:15:53,286 librdkafka log level: 6
2022-11-14 19:15:53,328 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 560436}
2022-11-14 19:16:29,512 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 131, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 166, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/streaming/filter.py", line 26, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/streaming/transform.py", line 63, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/streaming/collect.py", line 135, in poll
    self.close_and_reset_batch()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/streaming/collect.py", line 199, in close_and_reset_batch
    self.future.result(timeout=5)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
... ... 
snuba.clickhouse.errors.ClickhouseWriterError: No such column _tags_flattened in table default.transactions_local: Insertion status:
2022-11-14 19:16:29,519 Terminating <snuba.consumers.consumer.ProcessedMessageBatchWriter object at 0x7f69c35d7b80>...
2022-11-14 19:16:29,519 Closing <snuba.utils.streams.kafka_consumer_with_commit_log.KafkaConsumerWithCommitLog object at 0x7f69c35bb820>...
2022-11-14 19:16:29,519 Partitions revoked: [Partition(topic=Topic(name='events'), index=0)]
2022-11-14 19:16:29,529 Processor terminated
onewland commented 1 year ago

sentry-kubernetes is a third-party repository -- those versions have no meaning for us

onewland commented 1 year ago

It sounds like you need to update to the latest snuba code

xyfleet commented 1 year ago

It sounds like you need to update to the latest snuba code

@onewland Thanks. Do you want to me to install the latest snuba image (22.10.0) in my current deployment? Could you please explain a little more? As I mentioned I tried to upgrade sentry from chart 14.1.0 to 15.3.0 but failed. (The Snuba upgraded from 22.5.0 to 22.9.0).

onewland commented 1 year ago

Note that we don't author the releases for those charts, we don't use them for our own deploys, and we don't explicitly support them. They are a third party who packages our software and we exercise no control over what they release.

For further diagnosis into the snuba error you're seeing, can you share the output of grep 'release' $SNUBA_ROOT/docs/source/conf.py in the snuba transactions consumer pod? I'm not sure what the correct value of $SNUBA_ROOT will be in this deployment as, again, it's not under our control.

xyfleet commented 1 year ago

@onewland Thanks for your update. I wish you guys can release an official helm chart for us to install easily. The snuba transaction consumer pod keeps crashing, it's hard to get in.

# I only find one file called conf.py. Not sure if it's what you are looking for.
/usr/local/lib/python3.8/site-packages/mypyc/doc/conf.py

In this file,

# The short X.Y version.
version = mypy_version.split('-')[0]
# The full version, including alpha/beta/rc tags.
release = mypy_version

btw, in this pod,

snuba --version
snuba, version 22.9.0
onewland commented 1 year ago

That looks up to date enough that you shouldn't be seeing that error message in your logs (snuba 22.9.0 will not try to write _tags_flattened to the transactions_local table).

Since you can run snuba --version in the pod, can you try running the following command and see if you get a different result or better diagnostic logs?

snuba consumer --storage=transactions\
    --auto-offset-reset=latest\
    --consumer-group={your consumer group}\
    --log-level=debug

{your consumer group} is probably set somewhere in the helm chart, if you look for a pod definition with a command-line with the arguments snuba consumer --storage=transactions [...other arguments]

xyfleet commented 1 year ago

@onewland I finally got a chance to rebuild sentry, so I just installed the latest sentry to pass this Snuba issue.

hyunkj42 commented 1 year ago

@onewland hi I am using snuba 22.11.0. But the above error still occurs.

snuba.clickhouse.errors.ClickhouseWriterError: No such column _tags_flattened in table transactions_local: Insertion status:
maxboone commented 1 year ago

Update

So, the strangest thing just happened, I added some lines to see what was actually happening / what the writer was dying on. By doing the following in consumer.py:

x = itertools.chain.from_iterable(message.payload.rows for message in self.__messages)
print(list(x))
self.__writer.write(x)

This dumped a bunch of text into my terminal, and then it didn't throw. When I remove the print(list(x)), it dies after a couple seconds complaining, when I leave that in there, it works like a charm. Almost as if forcing the iterator to yield before feeding it to the clickhouse batch writer makes it work:

Full log when printing is enabled
snuba@njord-...-consumer-...:/usr/src/snuba$ snuba consumer --storage transactions --consumer-group transactions_group --log-level debug
2023-02-27 16:46:55,151 Consumer Starting
2023-02-27 16:46:55,151 librdkafka log level: 7
2023-02-27 16:46:55,153 Starting
2023-02-27 16:46:55,158 New partitions assigned: {Partition(topic=Topic(name='transactions'), index=0): 595}
2023-02-27 16:46:55,159 Initialized processing strategy: 
2023-02-27 16:46:57,163 Time limit reached, closing ...
[b'{"deleted":0,"retention_days":90,"event_id":"REDACTED GUID","project_id":1,...REDACTED FOR PRIVACY...}']
2023-02-27 16:46:57,164 Starting new HTTP connection (1): njord-sentry-clickhouse:8123
2023-02-27 16:46:57,165 Finished sending data from .
2023-02-27 16:46:57,166 http://njord-sentry-clickhouse:8123 "POST /?load_balancing=in_order&insert_distributed_sync=1&insert_allow_materialized_columns=1&input_format_skip_unknown_fields=1&query=INSERT+INTO+default.transactions_dist++FORMAT+JSONEachRow HTTP/1.1" 200 None
2023-02-27 16:46:57,167 Received response for .
2023-02-27 16:46:57,167 Waited 0.0034 seconds for 6 rows to be written to .
2023-02-27 16:46:57,167 Committing offsets: {Partition(topic=Topic(name='transactions'), index=0): Position(offset=601, timestamp=datetime.datetime(2023, 2, 27, 16, 46, 42, 926000))}
2023-02-27 16:46:57,171 Waited 0.0036 seconds for offsets to be committed to .
2023-02-27 16:46:57,171 Completed processing .
2023-02-27 16:47:42,653 Time limit reached, closing ...
[b'{"deleted":0,"retention_days":90,"event_id":"REDACTED GUID","project_id":1,...REDACTED FOR PRIVACY...}']
2023-02-27 16:47:42,654 Resetting dropped connection: njord-sentry-clickhouse
2023-02-27 16:47:42,655 Finished sending data from .
2023-02-27 16:47:42,656 http://njord-sentry-clickhouse:8123 "POST /?load_balancing=in_order&insert_distributed_sync=1&insert_allow_materialized_columns=1&input_format_skip_unknown_fields=1&query=INSERT+INTO+default.transactions_dist++FORMAT+JSONEachRow HTTP/1.1" 200 None
2023-02-27 16:47:42,657 Received response for .
2023-02-27 16:47:42,657 Waited 0.0036 seconds for 5 rows to be written to .
2023-02-27 16:47:42,657 Committing offsets: {Partition(topic=Topic(name='transactions'), index=0): Position(offset=606, timestamp=datetime.datetime(2023, 2, 27, 16, 47, 40, 649000))}
2023-02-27 16:47:42,662 Waited 0.0044 seconds for offsets to be committed to .
2023-02-27 16:47:42,662 Completed processing .
2023-02-27 16:47:44,927 Time limit reached, closing ...
[b'{"deleted":0,"retention_days":90,"event_id":"REDACTED GUID","project_id":1,...REDACTED FOR PRIVACY...}']
2023-02-27 16:47:44,928 Finished sending data from .
2023-02-27 16:47:44,929 http://njord-sentry-clickhouse:8123 "POST /?load_balancing=in_order&insert_distributed_sync=1&insert_allow_materialized_columns=1&input_format_skip_unknown_fields=1&query=INSERT+INTO+default.transactions_dist++FORMAT+JSONEachRow HTTP/1.1" 200 None
2023-02-27 16:47:44,930 Received response for .
2023-02-27 16:47:44,930 Waited 0.0025 seconds for 3 rows to be written to .
2023-02-27 16:47:44,930 Committing offsets: {Partition(topic=Topic(name='transactions'), index=0): Position(offset=609, timestamp=datetime.datetime(2023, 2, 27, 16, 47, 42, 926000))}
2023-02-27 16:47:44,931 Waited 0.0011 seconds for offsets to be committed to .
2023-02-27 16:47:44,931 Completed processing .
2023-02-27 16:48:44,938 Time limit reached, closing ...
[b'{"deleted":0,"retention_days":90,"event_id":"REDACTED GUID","project_id":1,...REDACTED FOR PRIVACY...}']
2023-02-27 16:48:44,939 Resetting dropped connection: njord-sentry-clickhouse
2023-02-27 16:48:44,940 Finished sending data from .
2023-02-27 16:48:44,941 http://njord-sentry-clickhouse:8123 "POST /?load_balancing=in_order&insert_distributed_sync=1&insert_allow_materialized_columns=1&input_format_skip_unknown_fields=1&query=INSERT+INTO+default.transactions_dist++FORMAT+JSONEachRow HTTP/1.1" 200 None
2023-02-27 16:48:44,942 Received response for .
2023-02-27 16:48:44,942 Waited 0.0032 seconds for 1 rows to be written to .
2023-02-27 16:48:44,942 Committing offsets: {Partition(topic=Topic(name='transactions'), index=0): Position(offset=610, timestamp=datetime.datetime(2023, 2, 27, 16, 48, 42, 936000))}
2023-02-27 16:48:44,943 Waited 0.0014 seconds for offsets to be committed to .
2023-02-27 16:48:44,943 Completed processing .
2023-02-27 16:49:11,405 Time limit reached, closing ...
[b'{"deleted":0,"retention_days":90,"event_id":"REDACTED GUID","project_id":1,...REDACTED FOR PRIVACY...}']
2023-02-27 16:49:11,406 Resetting dropped connection: njord-sentry-clickhouse
2023-02-27 16:49:11,407 Finished sending data from .
2023-02-27 16:49:11,408 http://njord-sentry-clickhouse:8123 "POST /?load_balancing=in_order&insert_distributed_sync=1&insert_allow_materialized_columns=1&input_format_skip_unknown_fields=1&query=INSERT+INTO+default.transactions_dist++FORMAT+JSONEachRow HTTP/1.1" 200 None
2023-02-27 16:49:11,408 Received response for .
2023-02-27 16:49:11,409 Waited 0.0028 seconds for 1 rows to be written to .
2023-02-27 16:49:11,409 Committing offsets: {Partition(topic=Topic(name='transactions'), index=0): Position(offset=611, timestamp=datetime.datetime(2023, 2, 27, 16, 49, 9, 398000))}
2023-02-27 16:49:11,410 Waited 0.0010 seconds for offsets to be committed to .
2023-02-27 16:49:11,410 Completed processing .
2023-02-27 16:49:44,944 Time limit reached, closing ...
[b'{"deleted":0,"retention_days":90,"event_id":"REDACTED GUID","project_id":1,...REDACTED FOR PRIVACY...}']
2023-02-27 16:49:44,945 Resetting dropped connection: njord-sentry-clickhouse
2023-02-27 16:49:44,946 Finished sending data from .
2023-02-27 16:49:44,947 http://njord-sentry-clickhouse:8123 "POST /?load_balancing=in_order&insert_distributed_sync=1&insert_allow_materialized_columns=1&input_format_skip_unknown_fields=1&query=INSERT+INTO+default.transactions_dist++FORMAT+JSONEachRow HTTP/1.1" 200 None
2023-02-27 16:49:44,948 Received response for .
2023-02-27 16:49:44,948 Waited 0.0031 seconds for 1 rows to be written to .
2023-02-27 16:49:44,948 Committing offsets: {Partition(topic=Topic(name='transactions'), index=0): Position(offset=612, timestamp=datetime.datetime(2023, 2, 27, 16, 49, 42, 943000))}
2023-02-27 16:49:44,949 Waited 0.0014 seconds for offsets to be committed to .
2023-02-27 16:49:44,950 Completed processing .
2023-02-27 16:50:44,920 Time limit reached, closing ...
[b'{"deleted":0,"retention_days":90,"event_id":"REDACTED GUID","project_id":1,...REDACTED FOR PRIVACY...}']
2023-02-27 16:50:44,921 Resetting dropped connection: njord-sentry-clickhouse
2023-02-27 16:50:44,921 Finished sending data from .
2023-02-27 16:50:44,922 http://njord-sentry-clickhouse:8123 "POST /?load_balancing=in_order&insert_distributed_sync=1&insert_allow_materialized_columns=1&input_format_skip_unknown_fields=1&query=INSERT+INTO+default.transactions_dist++FORMAT+JSONEachRow HTTP/1.1" 200 None
2023-02-27 16:50:44,922 Received response for .
2023-02-27 16:50:44,922 Waited 0.0020 seconds for 1 rows to be written to .
2023-02-27 16:50:44,922 Committing offsets: {Partition(topic=Topic(name='transactions'), index=0): Position(offset=613, timestamp=datetime.datetime(2023, 2, 27, 16, 50, 42, 918000))}
2023-02-27 16:50:44,924 Waited 0.0012 seconds for offsets to be committed to .
2023-02-27 16:50:44,924 Completed processing .
2023-02-27 16:51:44,954 Time limit reached, closing ...
[b'{"deleted":0,"retention_days":90,"event_id":"REDACTED GUID","project_id":1,...REDACTED FOR PRIVACY...}']
2023-02-27 16:51:44,955 Resetting dropped connection: njord-sentry-clickhouse
2023-02-27 16:51:44,956 Finished sending data from .
2023-02-27 16:51:44,958 http://njord-sentry-clickhouse:8123 "POST /?load_balancing=in_order&insert_distributed_sync=1&insert_allow_materialized_columns=1&input_format_skip_unknown_fields=1&query=INSERT+INTO+default.transactions_dist++FORMAT+JSONEachRow HTTP/1.1" 200 None
2023-02-27 16:51:44,958 Received response for .
2023-02-27 16:51:44,958 Waited 0.0035 seconds for 1 rows to be written to .
2023-02-27 16:51:44,958 Committing offsets: {Partition(topic=Topic(name='transactions'), index=0): Position(offset=614, timestamp=datetime.datetime(2023, 2, 27, 16, 51, 42, 953000))}
2023-02-27 16:51:44,960 Waited 0.0014 seconds for offsets to be committed to .
2023-02-27 16:51:44,960 Completed processing .
2023-02-27 16:52:11,367 Time limit reached, closing ...
[b'{"deleted":0,"retention_days":90,"event_id":"REDACTED GUID","project_id":1,...REDACTED FOR PRIVACY...}']
2023-02-27 16:52:11,368 Resetting dropped connection: njord-sentry-clickhouse
2023-02-27 16:52:11,369 Finished sending data from .
2023-02-27 16:52:11,370 http://njord-sentry-clickhouse:8123 "POST /?load_balancing=in_order&insert_distributed_sync=1&insert_allow_materialized_columns=1&input_format_skip_unknown_fields=1&query=INSERT+INTO+default.transactions_dist++FORMAT+JSONEachRow HTTP/1.1" 200 None
2023-02-27 16:52:11,370 Received response for .
2023-02-27 16:52:11,370 Waited 0.0030 seconds for 1 rows to be written to .
2023-02-27 16:52:11,370 Committing offsets: {Partition(topic=Topic(name='transactions'), index=0): Position(offset=615, timestamp=datetime.datetime(2023, 2, 27, 16, 52, 9, 365000))}
2023-02-27 16:52:11,372 Waited 0.0013 seconds for offsets to be committed to .
2023-02-27 16:52:11,372 Completed processing .

The same happens when I add a print(list(values)) to http.py before this line. I am baffled, because just doing a list(values) doesn't work, but just adding a str(list(values)) without the print does. Just serializing the bytes to string (in e.g. the append method) does not seem to yield the desired effect.


Running the command manually with the debug option (in snuba, version 22.10.0) results in the same trace:

Error that's also reported above
snuba@njord-...-consumer-...-...:/etc/snuba$ snuba consumer --storage transactions --consumer-group transactions_group --log-level debug
2023-02-27 16:17:57,638 Consumer Starting
2023-02-27 16:17:57,638 librdkafka log level: 7
2023-02-27 16:17:57,639 Starting
2023-02-27 16:17:57,645 New partitions assigned: {Partition(topic=Topic(name='transactions'), index=0): 0}
2023-02-27 16:17:57,645 Initialized processing strategy: 
2023-02-27 16:17:59,670 Time limit reached, closing ...
2023-02-27 16:17:59,671 Starting new HTTP connection (1): njord-sentry-clickhouse:8123
2023-02-27 16:17:59,673 Finished sending data from .
2023-02-27 16:17:59,692 http://njord-sentry-clickhouse:8123 "POST /?load_balancing=in_order&insert_distributed_sync=1&insert_allow_materialized_columns=1&input_format_skip_unknown_fields=1&query=INSERT+INTO+default.transactions_dist++FORMAT+JSONEachRow HTTP/1.1" 500 None
2023-02-27 16:17:59,692 Received response for .
2023-02-27 16:18:44,892 Time limit reached, closing ...
2023-02-27 16:18:44,892 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 150, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 185, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/transform.py", line 63, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/collect.py", line 135, in poll
    self.close_and_reset_batch()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/collect.py", line 201, in close_and_reset_batch
    self.future.result(timeout=self.wait_timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/collect.py", line 213, in __finish_batch
    batch.close()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/collect.py", line 66, in close
    self.__step.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 291, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 107, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 274, in write
    batch.join()
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 222, in join
    raise ClickhouseWriterError(message, code=code, row=row)
snuba.clickhouse.errors.ClickhouseWriterError: Received from njord-sentry-clickhouse-2.njord-sentry-clickhouse-headless.njord-sentry.svc.k8s-ksrv-nl:9000. DB::Exception: No such column _contexts_flattened in table default.transactions_local. (version 20.8.19.4 (official build))
2023-02-27 16:18:44,895 Terminating ...
2023-02-27 16:18:44,895 Terminating ...
2023-02-27 16:18:44,896 Terminating ...
2023-02-27 16:18:44,896 Closing ...
2023-02-27 16:18:44,896 Partitions revoked: [Partition(topic=Topic(name='transactions'), index=0)]
2023-02-27 16:18:44,898 Processor terminated
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in 
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 190, in consumer
    consumer.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 150, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 185, in _run_once
    self.__processing_strategy.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/transform.py", line 63, in poll
    self.__next_step.poll()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/collect.py", line 135, in poll
    self.close_and_reset_batch()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/collect.py", line 201, in close_and_reset_batch
    self.future.result(timeout=self.wait_timeout)
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/collect.py", line 213, in __finish_batch
    batch.close()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/strategies/collect.py", line 66, in close
    self.__step.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 291, in close
    self.__insert_batch_writer.close()
  File "/usr/src/snuba/snuba/consumers/consumer.py", line 107, in close
    self.__writer.write(
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 274, in write
    batch.join()
  File "/usr/src/snuba/snuba/clickhouse/http.py", line 222, in join
    raise ClickhouseWriterError(message, code=code, row=row)
snuba.clickhouse.errors.ClickhouseWriterError: Received from njord-sentry-clickhouse-2.njord-sentry-clickhouse-headless.njord-sentry.svc.k8s-ksrv-nl:9000. DB::Exception: No such column _contexts_flattened in table default.transactions_local. (version 20.8.19.4 (official build))

While, weirdly enough, it looks like it is not snuba doing this, so trying to insert that way is coming from somewhere else:

snuba@njord-...-consumer-...-...:/usr/src/snuba$ grep -nrw . -e '_contexts_flattened'
./snuba/snuba_migrations/events/0011_rebuild_errors.py:111:    - No _tags_flattened and _contexts_flattened since these are already unused
./snuba/snuba_migrations/events/0003_errors.py:47:    Column("_contexts_flattened", String()),
./snuba/snuba_migrations/transactions/0001_transactions.py:52:    Column("_contexts_flattened", String()),
./snuba/snuba_migrations/transactions/0005_transactions_add_measurements.py:30:                after="_contexts_flattened",
./snuba/snuba_migrations/transactions/0005_transactions_add_measurements.py:55:                after="_contexts_flattened",
./snuba/snuba_migrations/transactions/0014_transactions_remove_flattened_columns.py:22:                column_name="_contexts_flattened",
./snuba/snuba_migrations/transactions/0014_transactions_remove_flattened_columns.py:31:                column=Column("_contexts_flattened", String()),
./snuba/snuba_migrations/transactions/0014_transactions_remove_flattened_columns.py:50:                column_name="_contexts_flattened",
./snuba/snuba_migrations/transactions/0014_transactions_remove_flattened_columns.py:59:                column=Column("_contexts_flattened", String()),
./snuba/snuba_migrations/transactions/0003_transactions_onpremise_fix_columns.py:62:                column=Column("_contexts_flattened", String()),
./tests/migrations/test_runner_individual.py:81:        `contexts.key` Array(String), `contexts.value` Array(String), `_contexts_flattened` String,
maxboone commented 1 year ago

So, decided to take another look after dinner, when the iterator values are printed (and converted to string) before going to the HTTPWriteBatch append function, they never get sent to Clickhouse. Hence why it doesn't crash.

Continuing the investigation, as I can't seem to find any mention of a _local table in the HTTP request being sent to clickhouse looking at the logging on that side:

DB::Exception: No such column _contexts_flattened in table default.transactions_local
2023.02.27 20:50:45.361815 [ 2296 ] {40bb7f1b-fbb1-4a2f-9074-62c222b5c7cd}  DistributedBlockOutputStream: default.transactions_dist: column transaction_hash will be removed, because it is MATERIALIZED
2023.02.27 20:50:45.361834 [ 2296 ] {40bb7f1b-fbb1-4a2f-9074-62c222b5c7cd}  DistributedBlockOutputStream: default.transactions_dist: column user_hash will be removed, because it is MATERIALIZED
2023.02.27 20:50:45.361840 [ 2296 ] {40bb7f1b-fbb1-4a2f-9074-62c222b5c7cd}  DistributedBlockOutputStream: default.transactions_dist: column _tags_hash_map will be removed, because it is MATERIALIZED
2023.02.27 20:50:45.361845 [ 2296 ] {40bb7f1b-fbb1-4a2f-9074-62c222b5c7cd}  DistributedBlockOutputStream: default.transactions_dist: column type will be removed, because it is MATERIALIZED
2023.02.27 20:50:45.361850 [ 2296 ] {40bb7f1b-fbb1-4a2f-9074-62c222b5c7cd}  DistributedBlockOutputStream: default.transactions_dist: column message will be removed, because it is MATERIALIZED
2023.02.27 20:50:45.361855 [ 2296 ] {40bb7f1b-fbb1-4a2f-9074-62c222b5c7cd}  DistributedBlockOutputStream: default.transactions_dist: column title will be removed, because it is MATERIALIZED
2023.02.27 20:50:45.361861 [ 2296 ] {40bb7f1b-fbb1-4a2f-9074-62c222b5c7cd}  DistributedBlockOutputStream: default.transactions_dist: column timestamp will be removed, because it is MATERIALIZED
2023.02.27 20:50:45.362315 [ 125 ] {d4695e6e-9adb-48c5-a336-16f8631b6cec}  ContextAccess (default): Access granted: SELECT(org_id, project_id, key_id, timestamp, category, outcome, reason, quantity) ON default.outcomes_raw_local
2023.02.27 20:50:45.362374 [ 125 ] {d4695e6e-9adb-48c5-a336-16f8631b6cec}  InterpreterSelectQuery: FetchColumns -> Complete
2023.02.27 20:50:45.362843 [ 125 ] {d4695e6e-9adb-48c5-a336-16f8631b6cec}  AggregatingTransform: Aggregating
2023.02.27 20:50:45.362869 [ 125 ] {d4695e6e-9adb-48c5-a336-16f8631b6cec}  Aggregator: Aggregation method: serialized
2023.02.27 20:50:45.362894 [ 125 ] {d4695e6e-9adb-48c5-a336-16f8631b6cec}  AggregatingTransform: Aggregated. 0 to 0 rows (from 22.00 B) in 0.000215536 sec. (0.0 rows/sec., 99.68 KiB/sec.)
2023.02.27 20:50:45.362902 [ 125 ] {d4695e6e-9adb-48c5-a336-16f8631b6cec}  Aggregator: Merging aggregated data
2023.02.27 20:50:45.364159 [ 125 ] {d4695e6e-9adb-48c5-a336-16f8631b6cec}  MemoryTracker: Peak memory usage (for query): 0.00 B.
2023.02.27 20:50:45.364375 [ 125 ] {}  TCPHandler: Processed in 0.004229266 sec.
2023.02.27 20:50:45.364856 [ 2296 ] {40bb7f1b-fbb1-4a2f-9074-62c222b5c7cd}  executeQuery: Code: 16, e.displayText() = DB::Exception: No such column _contexts_flattened in table default.transactions_local: Insertion status:
Wrote 0 blocks and 0 rows on shard 0 replica 0, njord-sentry-clickhouse-0.njord-sentry-clickhouse-headless.njord-sentry.svc.k8s-ksrv-nl:9000 (average 0 ms per block)
Wrote 0 blocks and 0 rows on shard 1 replica 0, njord-sentry-clickhouse-1.njord-sentry-clickhouse-headless.njord-sentry.svc.k8s-ksrv-nl:9000 (average 2 ms per block)
Wrote 0 blocks and 0 rows on shard 2 replica 0, njord-sentry-clickhouse-2.njord-sentry-clickhouse-headless.njord-sentry.svc.k8s-ksrv-nl:9000 (average 1 ms per block)
 (version 20.8.19.4 (official build)) (from 10.42.0.21:44286) (in query: INSERT INTO default.transactions_dist FORMAT JSONEachRow ), Stack trace (when copying this message, always include the lines below):

0. Poco::Exception::Exception(std::__1::basic_string, std::__1::allocator > const&, int) @ 0x18cefaa0 in /usr/bin/clickhouse
1. DB::Exception::Exception(std::__1::basic_string, std::__1::allocator > const&, int) @ 0xe63b33d in /usr/bin/clickhouse
2. ? @ 0x15982fe8 in /usr/bin/clickhouse
3. DB::InterpreterInsertQuery::execute() @ 0x1597e78f in /usr/bin/clickhouse
4. ? @ 0x160a5df8 in /usr/bin/clickhouse
5. ThreadPoolImpl::worker(std::__1::__list_iterator) @ 0xe6696b7 in /usr/bin/clickhouse
6. ThreadFromGlobalPool::ThreadFromGlobalPool::scheduleImpl(std::__1::function, int, std::__1::optional)::'lambda1'()>(void&&, void ThreadPoolImpl::scheduleImpl(std::__1::function, int, std::__1::optional)::'lambda1'()&&...)::'lambda'()::operator()() const @ 0xe669e2a in /usr/bin/clickhouse
7. ThreadPoolImpl::worker(std::__1::__list_iterator) @ 0xe668bc7 in /usr/bin/clickhouse
8. ? @ 0xe667093 in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

2023.02.27 20:50:45.365418 [ 2296 ] {40bb7f1b-fbb1-4a2f-9074-62c222b5c7cd}  DynamicQueryHandler: Code: 16, e.displayText() = DB::Exception: No such column _contexts_flattened in table default.transactions_local: Insertion status:
Wrote 0 blocks and 0 rows on shard 0 replica 0, njord-sentry-clickhouse-0.njord-sentry-clickhouse-headless.njord-sentry.svc.k8s-ksrv-nl:9000 (average 0 ms per block)
Wrote 0 blocks and 0 rows on shard 1 replica 0, njord-sentry-clickhouse-1.njord-sentry-clickhouse-headless.njord-sentry.svc.k8s-ksrv-nl:9000 (average 2 ms per block)
Wrote 0 blocks and 0 rows on shard 2 replica 0, njord-sentry-clickhouse-2.njord-sentry-clickhouse-headless.njord-sentry.svc.k8s-ksrv-nl:9000 (average 1 ms per block)
, Stack trace (when copying this message, always include the lines below):

0. Poco::Exception::Exception(std::__1::basic_string, std::__1::allocator > const&, int) @ 0x18cefaa0 in /usr/bin/clickhouse
1. DB::Exception::Exception(std::__1::basic_string, std::__1::allocator > const&, int) @ 0xe63b33d in /usr/bin/clickhouse
2. ? @ 0x15982fe8 in /usr/bin/clickhouse
3. DB::InterpreterInsertQuery::execute() @ 0x1597e78f in /usr/bin/clickhouse
4. ? @ 0x160a5df8 in /usr/bin/clickhouse
5. ThreadPoolImpl::worker(std::__1::__list_iterator) @ 0xe6696b7 in /usr/bin/clickhouse
6. ThreadFromGlobalPool::ThreadFromGlobalPool::scheduleImpl(std::__1::function, int, std::__1::optional)::'lambda1'()>(void&&, void ThreadPoolImpl::scheduleImpl(std::__1::function, int, std::__1::optional)::'lambda1'()&&...)::'lambda'()::operator()() const @ 0xe669e2a in /usr/bin/clickhouse
7. ThreadPoolImpl::worker(std::__1::__list_iterator) @ 0xe668bc7 in /usr/bin/clickhouse
8. ? @ 0xe667093 in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
 (version 20.8.19.4 (official build))
2023.02.27 20:50:45.365600 [ 2296 ] {40bb7f1b-fbb1-4a2f-9074-62c222b5c7cd}  MemoryTracker: Peak memory usage (for query): 5.72 KiB.
2023.02.27 20:50:45.365620 [ 2296 ] {}  MemoryTracker: Peak memory usage (for query): 5.72 KiB.
2023.02.27 20:50:46.495533 [ 71 ] {}  SystemLog (system.metric_log): Flushing system log, 8 entries to flush
2023.02.27 20:50:46.502740 [ 71 ] {}  DiskLocal: Reserving 1.00 MiB on disk `default`, having unreserved 25.57 GiB.
2023.02.27 20:50:46.503795 [ 71 ] {}  system.metric_log (ca72620d-4215-4e17-a9fd-5d28a8ad103f): Renaming temporary part tmp_insert_202302_1037797_1037797_0 to 202302_4885183_4885183_0.
2023.02.27 20:50:46.504049 [ 71 ] {}  SystemLog (system.metric_log): Flushed system log
2023.02.27 20:50:49.276921 [ 2266 ] {}  TCPHandlerFactory: TCP Request. Address: 65.108.41.148:55108
2023.02.27 20:50:49.276930 [ 2360 ] {}  TCPHandlerFactory: TCP Request. Address: 65.108.41.148:55106
2023.02.27 20:50:49.277024 [ 2266 ] {}  TCPHandler: Client has not sent any data.
2023.02.27 20:50:49.277051 [ 2360 ] {}  TCPHandler: Client has not sent any data.
2023.02.27 20:50:49.277079 [ 2266 ] {}  TCPHandler: Done processing connection.
2023.02.27 20:50:49.277103 [ 2360 ] {}  TCPHandler: Done processing connection.

In Dutch, there is a saying "al doende leert men", or translated, "by doing, one learns". I had no idea how distributed tables in Clickhouse work, but I do see a mention of the local table in ENGINE - so it looks like the issue here is that the local table does not match the distributed table?

Clickhouse SHOW CREATE TABLE default.transactions_dist
njord-sentry-clickhouse :) SHOW CREATE TABLE default.transactions_dist

SHOW CREATE TABLE default.transactions_dist

┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE default.transactions_dist
(
    `project_id` UInt64,
    `event_id` UUID,
    `trace_id` Nullable(UUID),
    `span_id` UInt64,
    `transaction_name` LowCardinality(String),
    `transaction_hash` UInt64 MATERIALIZED cityHash64(transaction_name),
    `transaction_op` LowCardinality(String),
    `transaction_status` UInt8 DEFAULT 2,
    `start_ts` DateTime,
    `start_ms` UInt16,
    `finish_ts` DateTime,
    `finish_ms` UInt16,
    `duration` UInt32,
    `platform` LowCardinality(String),
    `environment` LowCardinality(Nullable(String)),
    `release` LowCardinality(Nullable(String)),
    `dist` LowCardinality(Nullable(String)),
    `ip_address_v4` Nullable(IPv4),
    `ip_address_v6` Nullable(IPv6),
    `user` String DEFAULT '',
    `user_hash` UInt64 MATERIALIZED cityHash64(user),
    `user_id` Nullable(String),
    `user_name` Nullable(String),
    `user_email` Nullable(String),
    `sdk_name` LowCardinality(String) DEFAULT '',
    `sdk_version` LowCardinality(String) DEFAULT '',
    `http_method` LowCardinality(Nullable(String)),
    `http_referer` Nullable(String),
    `tags.key` Array(String),
    `tags.value` Array(String),
    `_tags_hash_map` Array(UInt64) MATERIALIZED arrayMap((k, v) -> cityHash64(concat(replaceRegexpAll(k, '(\\=|\\\\)', '\\\\\\1'), '=', v)), tags.key, tags.value),
    `contexts.key` Array(String),
    `contexts.value` Array(String),
    `measurements.key` Array(LowCardinality(String)),
    `measurements.value` Array(Float64),
    `span_op_breakdowns.key` Array(LowCardinality(String)),
    `span_op_breakdowns.value` Array(Float64),
    `partition` UInt16,
    `offset` UInt64,
    `message_timestamp` DateTime,
    `retention_days` UInt16,
    `deleted` UInt8,
    `type` LowCardinality(String) MATERIALIZED 'transaction',
    `message` String MATERIALIZED transaction_name,
    `title` String MATERIALIZED transaction_name,
    `timestamp` DateTime MATERIALIZED finish_ts,
    `_contexts_flattened` String,
    `_tags_flattened` String
)
ENGINE = Distributed('njord-sentry-clickhouse', 'default', 'transactions_local', rand()) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.002 sec. 
maxboone commented 1 year ago

So, after manually running over each column, the structure of the transactions_local was fine, but the migrations were not properly applied to the dist tables. Probably caused by how the helm chart ran the upgrades. In my case, the dist tables were all not upgraded from add_spans. So, I had to run the following migrations manually (thanks a ton for the CI that generates the SQL statements in the PRs / MRs):

ALTER TABLE transactions_dist ADD COLUMN IF NOT EXISTS spans Nested(op LowCardinality(String), group UInt64, exclusive_time Float64) AFTER span_op_breakdowns.value;
ALTER TABLE transactions_dist ADD COLUMN IF NOT EXISTS spans.exclusive_time_32 Array(Float32) AFTER spans.group;
ALTER TABLE transactions_dist DROP COLUMN IF EXISTS _tags_flattened;
ALTER TABLE transactions_dist DROP COLUMN IF EXISTS _contexts_flattened;
ALTER TABLE transactions_dist ADD COLUMN IF NOT EXISTS transaction_source LowCardinality(String) AFTER title;
ALTER TABLE transactions_dist ADD COLUMN IF NOT EXISTS group_ids Array(UInt64) AFTER timestamp;
ALTER TABLE transactions_dist ADD COLUMN IF NOT EXISTS app_start_type LowCardinality(String) AFTER group_ids;

After that, the consumer is showing me healthy 200s again.