getsentry / self-hosted

Sentry, feature-complete and packaged up for low-volume deployments and proofs-of-concept
https://develop.sentry.dev/self-hosted/
Other
7.84k stars 1.76k forks source link

Infinite Kafka loop while upgrading from 10.0.0.dev0 #331

Closed maximal closed 4 years ago

maximal commented 4 years ago

I had a clean install of 10.0.0.dev0. Pulled a repo, run sudo ./install.sh.

The result is I have an infinite loop of

2020-01-09 06:22:06,239 Connection to Kafka failed (attempt 0)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 49, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
2020-01-09 06:22:08,242 Connection to Kafka failed (attempt 1)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 49, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
2020-01-09 06:22:10,245 Connection to Kafka failed (attempt 2)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 49, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
2020-01-09 06:22:12,247 Connection to Kafka failed (attempt 3)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 49, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
2020-01-09 06:22:14,252 Connection to Kafka failed (attempt 4)

Now web service doesn’t start, telling something like:

ERROR: for web  Cannot start service web: driver failed programming external connectivity on endpoint sentry_onpremise_web_1 (f0ddcb13407cc4050bee51a945fe6e15f5e536229e0b6f1c93501e966f7ccfc7): Bind for 0.0.0.0:9000 failed: port is already allocated

Restarting Docker won’t help, nor killing the process listening to 9000.

BYK commented 4 years ago

@maximal the seemingly infinite loop is expected and it will eventually settle (and you should wait until it settles) as mentioned in #323.

For the binding issue, I suspect you have some custom configuration or environment variables you have set, getting uwsgi confused about ports and protocols. Can you share your configuration and custom env variables?

maximal commented 4 years ago

How long might it take? A waited a couple of hours.

My .env:

COMPOSE_PROJECT_NAME=sentry_onpremise
SENTRY_EVENT_RETENTION_DAYS=90

The installation is very basic except that I use nginx as a proxy domain → localhost:9000.

jazerix commented 4 years ago

Can confirm, ran a fresh install. And it's looping over the Kafka error, so far for 15 minutes in. @BYK how long is it supposed to take 😃?

BYK commented 4 years ago

Looks like Kafka is not able to get up at all. How much memory do you have on this system? Maybe we should increase our memory requirements? Or maybe you're starved of CPU?

OrfeasZ commented 4 years ago

I had the same issue as well. It looks like the cause is that previous versions named the containers sentry-onprem_XXX while the latest names them sentry_onpremise_XXX and therefore the script is unable to properly clean them up, leading in conflicts when it tries setting them up again.

I fixed this by manually stopping and removing all the sentry-onprem_* containers and running the install.sh script again.

BYK commented 4 years ago

@OrfeasZ - ah, that may happen if you are coming from Sentry 10 BETA. That said there's some code in ./install.sh to compensate for that. Anyway, always a good idea to try this, yes.

splattael commented 4 years ago

You might also want to remove volumes. To cleanup the current installation I'll do:

docker-compose down --volumes
maximal commented 4 years ago

I don’t have any docker-onprem_* container:

% sudo docker container ls
CONTAINER ID        IMAGE                             COMMAND                  CREATED             STATUS                           PORTS                          NAMES
0a281dca8121        confluentinc/cp-kafka:5.1.2       "/etc/confluent/dock…"   28 minutes ago      Up 16 seconds                    9092/tcp                       sentry_onpremise_kafka_1
f32a35e86c98        confluentinc/cp-zookeeper:5.1.2   "/etc/confluent/dock…"   28 minutes ago      Up 28 minutes                    2181/tcp, 2888/tcp, 3888/tcp   sentry_onpremise_zookeeper_1
a48871313f83        yandex/clickhouse-server:19.4     "/entrypoint.sh"         28 minutes ago      Restarting (70) 51 seconds ago                                  sentry_onpremise_clickhouse_1
5ffd76a882a5        redis:5.0-alpine                  "docker-entrypoint.s…"   28 minutes ago      Up 28 minutes                    6379/tcp                       sentry_onpremise_redis_1
3e43baf87305        baf2bac6909e                      "/bin/sh -c 'exec $S…"   2 weeks ago         Up 2 weeks                       9000/tcp                       sentry-onpremise_worker_1
c55be6f95e3a        baf2bac6909e                      "/bin/sh -c 'exec $S…"   2 weeks ago         Up 2 weeks                       9000/tcp                       sentry-onpremise_cron_1
e65ba465a007        0a43f4d54413                      "/entrypoint.sh '0 0…"   2 weeks ago         Up 2 weeks                       9000/tcp                       sentry-onpremise_sentry-cleanup_1
8687418db44d        baf2bac6909e                      "/bin/sh -c 'exec $S…"   2 weeks ago         Up 2 weeks                       0.0.0.0:9000->9000/tcp         sentry-onpremise_web_1
f33e2a551fd0        52ff2f70ce97                      "/entrypoint.sh '55 …"   2 weeks ago         Up 2 weeks                       3021/tcp                       sentry-onpremise_symbolicator-cleanup_1
b7b8fa8e4569        f2d2da71650e                      "/bin/bash /docker-e…"   2 weeks ago         Up 2 weeks                       3021/tcp                       sentry-onpremise_symbolicator_1
a08e9db2d6d8        f5bb4800d474                      "/entrypoint.sh '*/5…"   2 weeks ago         Up 2 weeks                       1218/tcp                       sentry-onpremise_snuba-cleanup_1
a2956b2b9972        271fe968ae9c                      "./docker_entrypoint…"   2 weeks ago         Up 2 weeks                       1218/tcp                       sentry-onpremise_snuba-api_1
9d4b2fb9b026        271fe968ae9c                      "./docker_entrypoint…"   2 weeks ago         Up 46 hours                      1218/tcp                       sentry-onpremise_snuba-replacer_1
63d707e3e0b4        271fe968ae9c                      "./docker_entrypoint…"   2 weeks ago         Up 46 hours                      1218/tcp                       sentry-onpremise_snuba-consumer_1
cc2157f23291        confluentinc/cp-kafka:5.1.2       "/etc/confluent/dock…"   2 weeks ago         Up 2 weeks                       9092/tcp                       sentry-onpremise_kafka_1
bad751f33e63        a49ff3e0d85f                      "docker-entrypoint.s…"   2 weeks ago         Up 2 weeks                       6379/tcp                       sentry-onpremise_redis_1
a0c5d03b8ac8        memcached:1.5-alpine              "docker-entrypoint.s…"   2 weeks ago         Up 2 weeks                       11211/tcp                      sentry-onpremise_memcached_1
f146b77c59ed        66f224075d8f                      "docker-entrypoint.s…"   2 weeks ago         Up 2 weeks                       25/tcp                         sentry-onpremise_smtp_1
f4619775b261        confluentinc/cp-zookeeper:5.1.2   "/etc/confluent/dock…"   2 weeks ago         Up 2 weeks                       2181/tcp, 2888/tcp, 3888/tcp   sentry-onpremise_zookeeper_1
4ccdde74b4e5        yandex/clickhouse-server:19.4     "/entrypoint.sh"         2 weeks ago         Up 2 weeks                       8123/tcp, 9000/tcp, 9009/tcp   sentry-onpremise_clickhouse_1

Run sudo docker-compose down --volumes, got:

Stopping sentry_onpremise_kafka_1      ... done
Stopping sentry_onpremise_zookeeper_1  ... done
Stopping sentry_onpremise_clickhouse_1 ... done
Stopping sentry_onpremise_redis_1      ... done
Removing sentry_onpremise_snuba-api_run_28a5c1c68564 ... done
Removing sentry_onpremise_kafka_1                    ... done
Removing sentry_onpremise_zookeeper_1                ... done
Removing sentry_onpremise_clickhouse_1               ... done
Removing sentry_onpremise_redis_1                    ... done
Removing network sentry_onpremise_default
Volume sentry-data is external, skipping
Volume sentry-postgres is external, skipping
Volume sentry-redis is external, skipping
Volume sentry-zookeeper is external, skipping
Volume sentry-kafka is external, skipping
Volume sentry-clickhouse is external, skipping
Volume sentry-symbolicator is external, skipping
Removing volume sentry_onpremise_sentry-secrets
Removing volume sentry_onpremise_sentry-smtp
Removing volume sentry_onpremise_sentry-zookeeper-log
Removing volume sentry_onpremise_sentry-kafka-log
Removing volume sentry_onpremise_sentry-smtp-log

Running sudo ./install.sh after that is still falling into infinite loop of creating/connecting Kafka.

I have 8 GB of RAM and 4-core CPU.

Install logs: sentry_install_log-2020-01-10_14-29-38.txt

OrfeasZ commented 4 years ago

@maximal Sorry, there was a typo in my comment. I meant to write sentry-onprem. Try deleting all your sentry-onpremise_* containers (the ones with the dash, not the underscore) and possibly also your sentry-kafka volume (if it exists) and then run the install.sh script again.

maximal commented 4 years ago

Removed all sentry-onpremise_* containers:

sudo docker container rm sentry-onpremise_worker_1 sentry-onpremise_cron_1 sentry-onpremise_sentry-cleanup_1 sentry-onpremise_web_1 sentry-onpremise_symbolicator-cleanup_1 sentry-onpremise_symbolicator_1 sentry-onpremise_snuba-cleanup_1 sentry-onpremise_snuba-api_1 sentry-onpremise_snuba-replacer_1 sentry-onpremise_snuba-consumer_1 sentry-onpremise_kafka_1 sentry-onpremise_redis_1 sentry-onpremise_memcached_1 sentry-onpremise_smtp_1 sentry-onpremise_zookeeper_1 sentry-onpremise_clickhouse_1 --force

Removed sentry-kafka volume:

sudo docker volume rm sentry-kafka

Run sudo ./install.sh. Finally, it worked excellent! sudo docker-compose up -d and we’re done!

Done. Thank you all guys!


By the way, there was some different error at the end of install:

Creating sentry_onpremise_snuba-replacer_1 ... done
Creating sentry_onpremise_snuba-api_1      ... done
Creating sentry_onpremise_snuba-consumer_1 ... done
Creating sentry_onpremise_memcached_1      ... done
Creating sentry_onpremise_symbolicator_1   ... done
Creating sentry_onpremise_smtp_1           ... done
Creating sentry_onpremise_postgres_1       ... done
12:27:01 [WARNING] sentry.utils.geo: settings.GEOIP_PATH_MMDB not configured.
12:27:22 [INFO] sentry.plugins.github: apps-not-configured
!! Configuration error: sentry.search service failed to call validate()
'module' object has no attribute 'SnubaSearchBackend'
Cleaning up...
OrfeasZ commented 4 years ago

@maximal to fix that error go into your sentry/sentry.conf.py file and edit the SENTRY_SEARCH option to read as follows:

SENTRY_SEARCH = "sentry.search.snuba.EventsDatasetSnubaSearchBackend"

Then, re-run install.sh.

maximal commented 4 years ago

@OrfeasZ, thanks! Set the variable, install.sh yields:

Creating sentry_onpremise_snuba-replacer_1 ... done
Creating sentry_onpremise_snuba-consumer_1 ... done
Creating sentry_onpremise_snuba-api_1      ... done
12:40:17 [WARNING] sentry.utils.geo: settings.GEOIP_PATH_MMDB not configured.
12:40:37 [INFO] sentry.plugins.github: apps-not-configured
Traceback (most recent call last):
  File "/usr/local/bin/sentry", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/__init__.py", line 164, in main
    cli(prog_name=get_prog(), obj={}, max_content_width=100)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/decorators.py", line 30, in inner
    return ctx.invoke(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/commands/upgrade.py", line 168, in upgrade
    _upgrade(not noinput, traceback, verbosity, not no_repair)
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/commands/upgrade.py", line 121, in _upgrade
    _migrate_from_south(verbosity)
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/commands/upgrade.py", line 93, in _migrate_from_south
    if not _has_south_history(connection):
  File "/usr/local/lib/python2.7/site-packages/sentry/runner/commands/upgrade.py", line 78, in _has_south_history
    cursor = connection.cursor()
  File "/usr/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 233, in cursor
    cursor = self.make_cursor(self._cursor())
  File "/usr/local/lib/python2.7/site-packages/sentry/db/postgres/decorators.py", line 44, in inner
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sentry/db/postgres/base.py", line 99, in _cursor
    cursor = super(DatabaseWrapper, self)._cursor()
  File "/usr/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 204, in _cursor
    self.ensure_connection()
  File "/usr/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 199, in ensure_connection
    self.connect()
  File "/usr/local/lib/python2.7/site-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/usr/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 199, in ensure_connection
    self.connect()
  File "/usr/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 171, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/usr/local/lib/python2.7/site-packages/django/db/backends/postgresql/base.py", line 176, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/usr/local/lib/python2.7/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL:  the database system is starting up

Cleaning up...

Is it fine?

OrfeasZ commented 4 years ago

@maximal That doesn't look right, but I'm not sure.

maximal commented 4 years ago

Anyway, the issue is closed. Thank you for your time!

BYK commented 4 years ago

@maximal that doesn't look right. Probably your postgres container is taking its time spinning up. I'd try running the script again and see what happens.

maximal commented 4 years ago

Seems like Postgres volume got corrupted during the endless attempts to fight this issue.

Logs were:

postgres_1                | PostgreSQL Database directory appears to contain a database; Skipping initialization
postgres_1                | 
postgres_1                | LOG:  database system was interrupted while in recovery at 2020-01-10 12:43:56 UTC
postgres_1                | HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
postgres_1                | LOG:  database system was not properly shut down; automatic recovery in progress
postgres_1                | LOG:  redo starts at 0/645760B8
postgres_1                | WARNING:  specified item offset is too large
postgres_1                | CONTEXT:  xlog redo at 0/645A98D8 for Btree/INSERT_LEAF: off 70
postgres_1                | PANIC:  btree_insert_redo: failed to add item
postgres_1                | CONTEXT:  xlog redo at 0/645A98D8 for Btree/INSERT_LEAF: off 70
postgres_1                | LOG:  startup process (PID 25) was terminated by signal 6: Aborted
postgres_1                | LOG:  aborting startup due to startup process failure
postgres_1                | LOG:  database system is shut down

I made a clean fresh install already, unfortunately. But this Kafka issue easily solved by https://github.com/getsentry/onpremise/issues/331#issuecomment-573013630 and https://github.com/getsentry/onpremise/issues/331#issuecomment-573017908, thus closed.

kocoten1992 commented 4 years ago

Hi, I'm having the exact same problem,

+ exec gosu snuba snuba bootstrap --force
2020-05-05 18:49:59,366 Connection to Kafka failed (attempt 0)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
...

after snuba bootstrap, Kafka never wake up and the script timeout.

maximal commented 4 years ago

Having this issue while upgrading 10.0.0.dev0 to the new fresh dev version again.

maximal commented 4 years ago

Removing sentry-kafka volume and sentry_onpremise_kafka_1 container won’t help:

docker-compose down
docker container rm sentry_onpremise_kafka_1
docker volume rm sentry-kafka
./install.sh

yields:


... ... ...

+ snuba bootstrap --help
+ set -- snuba bootstrap --force
+ set gosu snuba snuba bootstrap --force
+ exec gosu snuba snuba bootstrap --force
2020-05-05 23:19:38,870 Connection to Kafka failed (attempt 0)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
2020-05-05 23:19:40,873 Connection to Kafka failed (attempt 1)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}

... ... ...

2020-05-05 23:21:35,085 Connection to Kafka failed (attempt 58)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
2020-05-05 23:21:37,088 Connection to Kafka failed (attempt 59)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 11, in <module>
    load_entry_point('snuba', 'console_scripts', 'snuba')()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
Cleaning up...
maximal commented 4 years ago

“Reverting” #465 helps indeed. In docker-compose.yml:

Line 66: 'confluentinc/cp-zookeeper:5.5.0' → 'confluentinc/cp-zookeeper:5.1.2'
and
Line 80: 'confluentinc/cp-kafka:5.5.0' → 'confluentinc/cp-kafka:5.1.2'

After that, ./install.sh runs just fine, doing migrations at the end and exiting normally.

kocoten1992 commented 4 years ago

“Reverting” #465 helps indeed. In docker-compose.yml:

Line 66: 'confluentinc/cp-zookeeper:5.5.0' → 'confluentinc/cp-zookeeper:5.1.2'
and
Line 80: 'confluentinc/cp-kafka:5.5.0' → 'confluentinc/cp-kafka:5.1.2'

After that, ./install.sh runs just fine, doing migrations at the end and exiting normally.

Your methods work, thanks!

cc @BYK , cause you did that commit - without automation test.

BYK commented 4 years ago

@kocoten1992

cc @BYK , cause you did that commit - without automation test.

It is not easy to test every possible case. We do have some basic tests to ensure at least Sentry is up and running but having upgrade tests would be a bit too costly. We may invest in that in the near future. In the meantime if you have the time to help, we'd appreciate any contributions to make these safer.

maximal commented 4 years ago

@kocoten1992, see https://github.com/getsentry/onpremise/issues/472#issuecomment-624719355

BYK commented 4 years ago

Since what is being experienced here is from a new reason, closing this issue in favor of #472. We can keep the discussion there.

kocoten1992 commented 4 years ago

hey, sorry for my attitude, I accidentally do docker prune all, and I get it on you, srr

BYK commented 4 years ago

@kocoten1992 no worries. My offer for any contributions if you have the time is still valid 🙂