SigNoz / signoz

SigNoz is an open-source observability platform native to OpenTelemetry with logs, traces and metrics in a single application. An open-source alternative to DataDog, NewRelic, etc. πŸ”₯ πŸ–₯. πŸ‘‰ Open source Application Performance Monitoring (APM) & Observability tool
https://signoz.io
Other
18.45k stars 1.17k forks source link

Significant number of Clickhouse exceptions #4931

Open byronwall opened 4 months ago

byronwall commented 4 months ago

Bug description

The signoz-clickhouse container is filled with tons of logs related to failed background tasks. Most errors appear to include the same language (example below). The main problem for me is that these logs or exceptions appear to take significant disk space. If it's possible to reduce their disk impact, I wouldn't care.

Important part of exceptions is probably: Code: 117. DB::Exception: Invalid flag for negative store: [...] in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 286)

Expected behavior

Exceptions should probably not happen as frequently. If they do, they should not fill the disk?

How to reproduce

  1. Nothing in particular, I am just running the docker containers.
  2. I started in version 0.35 or so and have been migrating as updates come out.

Version information

Additional context

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

I added line breaks so you don't have to scroll this:

2024.04.29 12:30:55.663013 [ 634 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task 
{2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_9019_9082_3}: Code: 117. DB::Exception: Invalid flag for negative store: 
(while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_9019_9057_2/ in 
table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 
with max_rows_to_read = 286): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying 
this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c800f1b in /usr/bin/clickhouse
1. DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x000000000721a243 in /usr/bin/clickhouse
2. DB::DDSketchDenseLogarithmic::deserialize(DB::ReadBuffer&) @ 0x000000000d5260e0 in /usr/bin/clickhouse
3. DB::SerializationAggregateFunction::deserializeBinaryBulk(DB::IColumn&, DB::ReadBuffer&, unsigned long, double) const
 @ 0x00000000108b58e9 in /usr/bin/clickhouse
4. DB::ISerialization::deserializeBinaryBulkWithMultipleStreams(COW<DB::IColumn>::immutable_ptr<DB::IColumn>&, unsigned long, DB::ISerialization::DeserializeBinaryBulkSettings&, std::shared_ptr<DB::ISerialization::DeserializeBinaryBulkState>&, std::unordered_map<String, COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::hash<String>, std::equal_to<String>, std::allocator<std::pair<String const, COW<DB::IColumn>::immutable_ptr<DB::IColumn>>>>*) const @ 0x00000000108b01d9 in /usr/bin/clickhouse
5. DB::MergeTreeReaderCompact::readRows(unsigned long, unsigned long, bool, unsigned long, std::vector<COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::allocator<COW<DB::IColumn>::immutable_ptr<DB::IColumn>>>&) @ 0x000000001250a6b9 in /usr/bin/clickhouse
6. DB::MergeTreeSequentialSource::generate() @ 0x000000001251ca4d in /usr/bin/clickhouse
7. DB::ISource::tryGenerate() @ 0x000000001297acf5 in /usr/bin/clickhouse
8. DB::ISource::work() @ 0x000000001297a743 in /usr/bin/clickhouse
9. DB::ExecutionThreadContext::executeTask() @ 0x000000001299371a in /usr/bin/clickhouse
10. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000001298a170 in /usr/bin/clickhouse
11. DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x0000000012989928 in /usr/bin/clickhouse
12. DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x0000000012998017 in /usr/bin/clickhouse
13. DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x00000000129981d3 in /usr/bin/clickhouse
14. DB::MergeTask::ExecuteAndFinalizeHorizontalPart::executeImpl() @ 0x000000001233b6f2 in /usr/bin/clickhouse
15. DB::MergeTask::ExecuteAndFinalizeHorizontalPart::execute() @ 0x000000001233b64b in /usr/bin/clickhouse
16. DB::MergeTask::execute() @ 0x0000000012340d99 in /usr/bin/clickhouse
17. DB::MergePlainMergeTreeTask::executeStep() @ 0x0000000012723517 in /usr/bin/clickhouse
18. DB::MergeTreeBackgroundExecutor<DB::DynamicRuntimeQueue>::threadFunction() @ 0x00000000123532c4 in /usr/bin/clickhouse
19. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>) @ 0x000000000c8eb0c1 in /usr/bin/clickhouse
20. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000c8ee8fa in /usr/bin/clickhouse
21. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000c8ed6fe in /usr/bin/clickhouse
22. ? @ 0x00007f6c29757609
23. ? @ 0x00007f6c2967c353

Thank you for your bug report – we love squashing them!

welcome[bot] commented 4 months ago

Thanks for opening this issue. A team member should give feedback soon. In the meantime, feel free to check out the contributing guidelines.

srikanthccv commented 4 months ago

This shouldn't be happening unless some data is corrupted. Can you share the number of records in the signoz_metrics.exp_hist table? Do you have negative latencies reported by your application for some reason?

byronwall commented 4 months ago

I ran some queries to hopefully help you out. I do not know if my application is returning negative latencies. The query below maybe says "no" if I understand the result. Having said that, I am not intentionally trying to return a negative latency. I am mostly using SigNoz to understand spans visually and review exceptions, so I don't spend any time looking at latency.

Let me know if you want me to run any more queries on the database. The ones below were guesses based on your question and ChatGPT.

docker exec -it signoz-clickhouse clickhouse-client --query="SELECT COUNT(*) FROM signoz_metrics.exp_hist"
97501

Command:

docker exec -it signoz-clickhouse clickhouse-client --query="SELECT COUNT(*) AS total_records, SUM(count) AS total_count, MIN(min) AS minimum_latency, MAX(max) AS maximum_latency, MIN(max) AS minimum_of_max, MAX(min) AS maximum_of_min FROM signoz_metrics.exp_hist WHERE metric_name = 'signoz_latency'"

Results:

total_records total_count minimum_latency maximum_latency minimum_of_max maximum_of_min
97501 7388390 0 54821048.4397 0 24613273.5689
srikanthccv commented 4 months ago

Thanks. I am not sure what could be the reason. This exception is thrown when the serialized data is incorrect and can't be deserialized. From the logs containing the part, does it show any day other than 20240322?

srikanthccv commented 4 months ago

Did start seeing them after after any specific update? What is your ClickHouse version?

srikanthccv commented 4 months ago

There is no change in the binary format. I suspect it more likely some data corruption issue or HW issue.

byronwall commented 4 months ago

With respect to dates, it appears that most dates are impacted:

docker logs signoz-clickhouse 2>&1 | grep 'negative store' | grep -oE '([0-9]{8})_' | awk -F "_" '{print $1}' | sort -u

Gives: 20240227, 20240228, 20240305, 20240306, 20240308, 20240311, 20240313, 20240314, 20240315, 20240318, 20240319, 20240320, 20240321, 20240322, 20240325, 20240327, 20240329, 20240404, 20240419, 20240422

My ClickHouse version:

docker exec signoz-clickhouse clickhouse-client --version
ClickHouse client version 24.1.2.5 (official build).

Regarding version where it may have happened, my guess it that it was sometime after upgrading to 0.38.2? That's just a guess based on the updates I went through. SigNoz was definitely OK for some duration and then these logs started filling up after those updates. I had to provisioning a bigger drive at some point and now that bigger one is filling up so I need to do something different.

image

For a sample of the full logs:

docker logs signoz-clickhouse 2>&1 | grep 'negative store' | head
2024.04.29 17:27:14.230807 [ 617 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240321_8209_8241_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240321_8209_8225_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 129): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.299663 [ 620 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_8715_8787_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_8715_8743_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 274): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.300471 [ 634 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_9019_9082_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_9019_9057_2/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 286): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.301010 [ 622 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240308_3932_4014_4}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240308_3932_3974_2/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 210): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.301457 [ 626 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240227_110_170_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240227_146_150_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 33): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.302400 [ 627 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240306_2855_2897_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240306_2855_2873_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 182): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.307170 [ 624 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_8885_8925_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_8885_8905_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 203): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.383201 [ 629 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240305_2341_2395_4}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240305_2358_2376_3/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 55): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:16.285638 [ 626 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240321_8209_8241_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240321_8209_8225_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 129): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:16.385360 [ 618 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240319_7342_7369_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240319_7342_7355_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 83): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
byronwall commented 4 months ago

Assuming some sort of corruption, is there a simple way to clear out the bad data or other ClickHouse issues? I am not concerned if I lose data. I am more concerned about the disk filling up again and losing the server.

If possible, it'd be nice to maintain the SigNoz settings, but I could rebuild those if that's needed.

Thanks for the quick replies and feedback. I really appreciate the support.

srikanthccv commented 4 months ago

It would help if we could find a way to reproduce this. If you are not using the signoz_latency metric, you can delete the data from this table and set this setting to false https://github.com/SigNoz/signoz/blob/506916661d4debe6ed98129803e0b6cf5d643fac/deploy/docker/clickhouse-setup/otel-collector-config.yaml#L104

uliss3s commented 3 months ago

This shouldn't be happening unless some data is corrupted. Can you share the number of records in the signoz_metrics.exp_hist table? Do you have negative latencies reported by your application for some reason?

I got this error while trying to query this table:

default> select *
         from signoz_metrics.exp_hist
[2024-05-22 07:52:35] Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/9f7/9f72bc6a-d77f-41a1-8fda-23319e4cba06/20240522_2871_2926_3/ in table signoz_metrics.exp_hist (9f72bc6a-d77f-41a1-8fda-23319e4cba06) located on disk default of type local, from mark 0 with max_rows_to_read = 14522): While executing MergeTreeSelect(pool: ReadPool, algorithm: Thread). (INCORRECT_DATA) (version 24.1.2.5 (official build))
[2024-05-22 07:52:35] , server ClickHouseNode [uri=http://10.10.4.32:8123/default, options={session_id=DataGrip_3a180a62-ec73-4294-a8c1-3fc1487ab43a}]@537063924

Tested querying every single column. The error ocurs only when querying the column sketch:

default> select sketch
         from signoz_metrics.exp_hist
[2024-05-22 07:57:11] Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/9f7/9f72bc6a-d77f-41a1-8fda-23319e4cba06/20240522_2355_2379_5/ in table signoz_metrics.exp_hist (9f72bc6a-d77f-41a1-8fda-23319e4cba06) located on disk default of type local, from mark 0 with max_rows_to_read = 4767): While executing MergeTreeSelect(pool: ReadPool, algorithm: Thread). (INCORRECT_DATA) (version 24.1.2.5 (official build))
[2024-05-22 07:57:11] , server ClickHouseNode [uri=http://10.10.4.32:8123/default, options={session_id=DataGrip_3a180a62-ec73-4294-a8c1-3fc1487ab43a}]@537063924

Signoz v0.45.0 (docker compose setup)

cat /etc/os-release

NAME="Oracle Linux Server"
VERSION="8.8"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Oracle Linux Server 8.8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:8:8:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://github.com/oracle/oracle-linux"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8"
ORACLE_BUGZILLA_PRODUCT_VERSION=8.8
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=8.8
viktara commented 1 month ago

Same issue here, self hosted signoz docker clickhouse set up. After some hours the issue occurs and puts the server under high load.

signoz version 0.49.1

Am setting enable_exp_histogram: false to see if that stops the problem.

This seems like a bug?