Open byronwall opened 4 months ago
Thanks for opening this issue. A team member should give feedback soon. In the meantime, feel free to check out the contributing guidelines.
This shouldn't be happening unless some data is corrupted. Can you share the number of records in the signoz_metrics.exp_hist
table? Do you have negative latencies reported by your application for some reason?
I ran some queries to hopefully help you out. I do not know if my application is returning negative latencies. The query below maybe says "no" if I understand the result. Having said that, I am not intentionally trying to return a negative latency. I am mostly using SigNoz to understand spans visually and review exceptions, so I don't spend any time looking at latency.
Let me know if you want me to run any more queries on the database. The ones below were guesses based on your question and ChatGPT.
docker exec -it signoz-clickhouse clickhouse-client --query="SELECT COUNT(*) FROM signoz_metrics.exp_hist"
97501
Command:
docker exec -it signoz-clickhouse clickhouse-client --query="SELECT COUNT(*) AS total_records, SUM(count) AS total_count, MIN(min) AS minimum_latency, MAX(max) AS maximum_latency, MIN(max) AS minimum_of_max, MAX(min) AS maximum_of_min FROM signoz_metrics.exp_hist WHERE metric_name = 'signoz_latency'"
Results:
total_records | total_count | minimum_latency | maximum_latency | minimum_of_max | maximum_of_min |
---|---|---|---|---|---|
97501 | 7388390 | 0 | 54821048.4397 | 0 | 24613273.5689 |
Thanks. I am not sure what could be the reason. This exception is thrown when the serialized data is incorrect and can't be deserialized. From the logs containing the part, does it show any day other than 20240322
?
Did start seeing them after after any specific update? What is your ClickHouse version?
There is no change in the binary format. I suspect it more likely some data corruption issue or HW issue.
With respect to dates, it appears that most dates are impacted:
docker logs signoz-clickhouse 2>&1 | grep 'negative store' | grep -oE '([0-9]{8})_' | awk -F "_" '{print $1}' | sort -u
Gives: 20240227
, 20240228
, 20240305
, 20240306
, 20240308
, 20240311
, 20240313
, 20240314
, 20240315
, 20240318
, 20240319
, 20240320
, 20240321
, 20240322
, 20240325
, 20240327
, 20240329
, 20240404
, 20240419
, 20240422
My ClickHouse version:
docker exec signoz-clickhouse clickhouse-client --version
ClickHouse client version 24.1.2.5 (official build).
Regarding version where it may have happened, my guess it that it was sometime after upgrading to 0.38.2? That's just a guess based on the updates I went through. SigNoz was definitely OK for some duration and then these logs started filling up after those updates. I had to provisioning a bigger drive at some point and now that bigger one is filling up so I need to do something different.
For a sample of the full logs:
docker logs signoz-clickhouse 2>&1 | grep 'negative store' | head
2024.04.29 17:27:14.230807 [ 617 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240321_8209_8241_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240321_8209_8225_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 129): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.299663 [ 620 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_8715_8787_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_8715_8743_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 274): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.300471 [ 634 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_9019_9082_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_9019_9057_2/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 286): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.301010 [ 622 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240308_3932_4014_4}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240308_3932_3974_2/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 210): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.301457 [ 626 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240227_110_170_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240227_146_150_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 33): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.302400 [ 627 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240306_2855_2897_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240306_2855_2873_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 182): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.307170 [ 624 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_8885_8925_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_8885_8905_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 203): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.383201 [ 629 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240305_2341_2395_4}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240305_2358_2376_3/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 55): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:16.285638 [ 626 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240321_8209_8241_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240321_8209_8225_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 129): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:16.385360 [ 618 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240319_7342_7369_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240319_7342_7355_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 83): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
Assuming some sort of corruption, is there a simple way to clear out the bad data or other ClickHouse issues? I am not concerned if I lose data. I am more concerned about the disk filling up again and losing the server.
If possible, it'd be nice to maintain the SigNoz settings, but I could rebuild those if that's needed.
Thanks for the quick replies and feedback. I really appreciate the support.
It would help if we could find a way to reproduce this. If you are not using the signoz_latency
metric, you can delete the data from this table and set this setting to false
https://github.com/SigNoz/signoz/blob/506916661d4debe6ed98129803e0b6cf5d643fac/deploy/docker/clickhouse-setup/otel-collector-config.yaml#L104
This shouldn't be happening unless some data is corrupted. Can you share the number of records in the
signoz_metrics.exp_hist
table? Do you have negative latencies reported by your application for some reason?
I got this error while trying to query this table:
default> select *
from signoz_metrics.exp_hist
[2024-05-22 07:52:35] Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/9f7/9f72bc6a-d77f-41a1-8fda-23319e4cba06/20240522_2871_2926_3/ in table signoz_metrics.exp_hist (9f72bc6a-d77f-41a1-8fda-23319e4cba06) located on disk default of type local, from mark 0 with max_rows_to_read = 14522): While executing MergeTreeSelect(pool: ReadPool, algorithm: Thread). (INCORRECT_DATA) (version 24.1.2.5 (official build))
[2024-05-22 07:52:35] , server ClickHouseNode [uri=http://10.10.4.32:8123/default, options={session_id=DataGrip_3a180a62-ec73-4294-a8c1-3fc1487ab43a}]@537063924
Tested querying every single column. The error ocurs only when querying the column sketch:
default> select sketch
from signoz_metrics.exp_hist
[2024-05-22 07:57:11] Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/9f7/9f72bc6a-d77f-41a1-8fda-23319e4cba06/20240522_2355_2379_5/ in table signoz_metrics.exp_hist (9f72bc6a-d77f-41a1-8fda-23319e4cba06) located on disk default of type local, from mark 0 with max_rows_to_read = 4767): While executing MergeTreeSelect(pool: ReadPool, algorithm: Thread). (INCORRECT_DATA) (version 24.1.2.5 (official build))
[2024-05-22 07:57:11] , server ClickHouseNode [uri=http://10.10.4.32:8123/default, options={session_id=DataGrip_3a180a62-ec73-4294-a8c1-3fc1487ab43a}]@537063924
Signoz v0.45.0 (docker compose setup)
cat /etc/os-release
NAME="Oracle Linux Server"
VERSION="8.8"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Oracle Linux Server 8.8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:8:8:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://github.com/oracle/oracle-linux"
ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8"
ORACLE_BUGZILLA_PRODUCT_VERSION=8.8
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=8.8
Same issue here, self hosted signoz docker clickhouse set up. After some hours the issue occurs and puts the server under high load.
signoz version 0.49.1
Am setting enable_exp_histogram: false to see if that stops the problem.
This seems like a bug?
Bug description
The
signoz-clickhouse
container is filled with tons of logs related to failed background tasks. Most errors appear to include the same language (example below). The main problem for me is that these logs or exceptions appear to take significant disk space. If it's possible to reduce their disk impact, I wouldn't care.Important part of exceptions is probably: Code: 117. DB::Exception: Invalid flag for negative store: [...] in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 286)
Expected behavior
Exceptions should probably not happen as frequently. If they do, they should not fill the disk?
How to reproduce
Version information
x86_64
Additional context
I added line breaks so you don't have to scroll this:
Thank you for your bug report β we love squashing them!