Otel-collector keeps crashing in docker-swarm setup with NFS data directory

sati-max commented 1 year ago

Bug description

Otel-collector keeps crashing / doesn't want to start.

When doing a fresh SigNoz deployment on a fresh Ubuntu + docker swarm setup the otel collector container loop crashes and doesn't want to start/work. 4 docker swarm nodes setup (2 managers, 2 workers, I also tested 1 manager + 1 worker, 3m+1w, same effect on every setup) with additional host working as a NFS share server (all docker swarm nodes have mounted one catalog from NFS in the same directory and signoz dir is /data/signoz/[...] ). Everything is in the same network, every hosts can ping/connect to other and firewall is disabled on every host.

After running docker stack services signoz NAME MODE REPLICAS IMAGE PORTS signoz_alertmanager replicated 1/1 signoz/alertmanager:0.23.0-0.2 signoz_clickhouse replicated 1/1 clickhouse/clickhouse-server:22.8.8-alpine signoz_frontend replicated 1/1 signoz/frontend:0.16.2 :3301->3301/tcp signoz_otel-collector global 0/0 signoz/signoz-otel-collector:0.66.5 :4317-4318->4317-4318/tcp, :54527->54527/tcp signoz_otel-collector-metrics replicated 0/1 signoz/signoz-otel-collector:0.66.5 signoz_query-service replicated 1/1 signoz/query-service:0.16.2 signoz_zookeeper-1 replicated 1/1 bitnami/zookeeper:3.7.0 :2181->2181/tcp, :2888->2888/tcp, :3888->3888/tcp

docker logs : application run finished with error: cannot build pipelines: failed to create "clickhouselogsexporter" exporter, in pipeline "logs": cannot configure clickhouse logs exporter: clickhouse Migrate failed to run, error: migration failed in line 0: RENAME TABLE IF EXISTS signoz_logs.logs_atrribute_keys TO signoz_logs.logs_attribute_keys on CLUSTER cluster; (details: code: 57, message: There was an error on [clickhouse:9000]: Code: 57. DB::Exception: Table signoz_logs.logs_attribute_keys already exists. (TABLE_ALREADY_EXISTS) (version 22.8.8.3 (official build)))

Sometimes the otel-collector manages to start properly and works until I either start/stop/restart the stack or kill the network connection to one of the nodes and then it loop crashes.

This doesn't happen if I run SigNoz without using the NFS share or if the NFS share is only available on one manager node.

Please describe.
If this affects the front-end, screenshots would be of great help.

Expected behavior

SigNoz works with external NFS share in a docker swarm setup.

How to reproduce

Installed docker, docker-compose, initiated docker swarm on manager1 (without any additional flags), joined other swarm nodes as manager2, worker1 and worker2
On manager1: git clone signoz repo into mounted /data nfs dir
Apply changes into docker-compose.yaml (disabled hotrod app, added syslog port into otel-collector service) and otel-collector-config.yaml (disabled collecting docker container logs and enabled syslog), hotrod/docker containers/syslog according to signoz documentation
docker stack deploy -c /data/signoz/deploy/docker-swarm/clickhouse-setup/docker-compose.yaml signoz

Version information

Signoz version: 0.16.2 and 0.17.0
Browser version: FFx 102.7.0esr
Your OS and version: Ubuntu 22.04.1 (every node)
Your CPU Architecture(ARM/Intel): Intel

Additional context

Tried to reach via slack with no success - between the slack post and github issue I did more tests thats why the setup is different (3m+1w vs 2m+2w).

Thank you for your bug report – we love squashing them!

Thank you for any suggestion what I'm doing wrong in this setup...

welcome[bot] commented 1 year ago

Thanks for opening this issue. A team member should give feedback soon. In the meantime, feel free to check out the contributing guidelines.

srikanthccv commented 1 year ago

Using NFS for the data directory is not something we tested with ClickHouse. There may be issues coming from ClickHouse or something from our own end.

sati-max commented 1 year ago

Using NFS for the data directory is not something we tested with ClickHouse. There may be issues coming from ClickHouse or something from our own end.

Hey, thanks for the information.

Can you say in what kind of setup SigNoz was tested, especially in a cluster type of environment with multiple SigNoz nodes/hosts?

SigNoz / signoz