Backend seems to freeze periodically

darksheik commented 3 years ago

Sorry if this is vague, but we had a strange situation on two separate days where our system seemed to be locking up completely and we couldn't pin it down. By process of elimination we removed LogflareLogger.HttpBackend from our logger config and everything was fine with only our file backend running.

In this situation below, a simple database call was hanging and ultimately died - and every other process that was trying to log something at the same time also hung, until it crashed and restarted. Then we'd get a burst of activity for the better portion of a minute and it would lock up again. This was happening on both of our production nodes.

Wed Jan 27 2021 16:08:23 Postgrex.Protocol (#PID<0.4548.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.20225.0> timed out because it queued and checked out the connection for longer than 60000ms #PID<0.20225.0> was at location: (stdlib 3.13) gen.erl:208: :gen.do_call/4 (stdlib 3.13) gen_event.erl:282: :gen_event.rpc/2 (logger 1.10.4) lib/logger/handler.ex:102: Logger.Handler.log/2 (kernel 7.0) logger_backend.erl:51: :logger_backend.call_handlers/3 ...

Are there known issues with the backend that might cause this, or are we possibly attempting to log more than it can handle? Is there a setting we might be able to adjust?

This started occurring today at around 2p ET. There is usually an increase of logging activity in our system around this time. One of our developers thinks perhaps we exceeded this, but just a guess:

:sync_threshold - if the Logger manager has more than :sync_threshold messages in its queue, Logger will change to sync mode, to apply backpressure to the clients. Logger will return to async mode once the number of messages in the queue is reduced to one below the sync_threshold. Defaults to 20 messages. :sync_threshold can be set to 0 to force sync mode.

gvaughn commented 3 years ago

We were definitely exceeding the sync_threshold. I captured this during troubleshooting. I assume the :httpc.handle_answer/3 call is waiting for an acknowledgement from Logflare

Phoenix_LiveDashboard

chasers commented 3 years ago

Thanks for this report. I don't think I've experienced this and I've sent thousands per second from our production servers before to load test.

Seems like I should be able to replicate it if I set the Logflare URL to something that just doesn't exist.

We'll figure it out and get a fix out. I'll let you know if I need any more info. Thanks again.

chasers commented 3 years ago

@gvaughn it looks like you're on Elixir 1.10? Can you upgrade to the latest Elixir? Seems like this would be a part of the problem:

https://github.com/elixir-lang/elixir/issues/10420

This was confusing because our backend uses Hackney and hackney doesn't use httpc.

Our backend causing sync mode is another issue. On accepting messages from Logger all we do is cache those in batches but it looks like flush! would block the genserver loop if we're responding abnormally slow which was the case for a bit yesterday. We'll wrap that in a Task which will fix this I think.

I'll do some more load testing forcing sync mode too. And make sure that our backend crashing will never have adverse effects on your app.

gvaughn commented 3 years ago

Yeah, we looked at that issue. I'm not convinced that's us because we don't have handle_otp_reports enabled and our stacktrace is not stuck in :gen.do_call/4. However, now that you tell me that logflare_backend uses hackney, I can't explain our stacktrace.

We're looking into upgrading Elixir plus probably customizing more Logger settings.

gvaughn commented 3 years ago

@chasers I haven't really dug into Logger much before. If I'm reading the docs right and we set the sync_threshold and discard_threshold to the same value, I think we'd avoid sync mode completely. Is that your understanding? That might be a decent stopgap/insurance for us.

chasers commented 3 years ago

@gvaughn yeah, I think so. I would maybe set them at 1000. Can you email me your Logflare source id? Or the email you used to sign up with? Email support@logflare.app.

I am QAing some updates currently btw.

chasers commented 3 years ago

@gvaughn @darksheik I've got rc.4 published on Hex and running on a test system and rolled out to our production cluster just now. All looks good so far.

darksheik commented 3 years ago

@chasers Thanks so much! We can try it out in a bit. We notice that you have moved to Mint + Finch - but hackney is still a dependency. Is that an oversight or do you still need it?

gvaughn commented 3 years ago

One other note is we got this warning

==> logflare_api_client
warning: the dependency :logflare_api_client requires Elixir "~> 1.11" but you are running on v1.10.4

Is that really dependent on 1.11 features? We've deployed it to our staging environment with no issues so far, but wanted to double-check.

chasers commented 3 years ago

Good call on both ... we will clean up the deps shortly.

chasers commented 3 years ago

@gvaughn @darksheik rc.6 is published for your testing pleasure.

darksheik commented 3 years ago

We've already got rc.4 running in prod and it has been working great through our busiest time of the day.

chasers commented 3 years ago

Great! I'm going to let this run on prod until tomorrow and then bump the official version.

darksheik commented 3 years ago

@chasers Everything seems to be working in prod, however I just started a remote console to my server and I'm seeing this message repeating every second or so.

21:43:30.877 [error] GenServer #PID<0.140.0> terminating
** (stop) {:EXIT, {%RuntimeError{message: "could not lookup Ecto repo LogflareLogger.Repo because it was not started or it does not exist"}, [{Ecto.Repo.Registry, :lookup, 1, [file: 'lib/ecto/repo/registry.ex', line: 19]}, {Ecto.Repo.Queryable, :execute, 4, [file: 'lib/ecto/repo/queryable.ex', line: 210]}, {Ecto.Repo.Queryable, :all, 3, [file: 'lib/ecto/repo/queryable.ex', line: 17]}, {LogflareLogger.BatchCache, :flush, 1, [file: 'lib/logflare_logger/batch_cache.ex', line: 44]}, {LogflareLogger.HttpBackend, :flush!, 1, [file: 'lib/logflare_logger/http_backend.ex', line: 113]}, {:gen_event, :server_update, 4, [file: 'gen_event.erl', line: 620]}, {:gen_event, :server_notify, 4, [file: 'gen_event.erl', line: 602]}, {:gen_event, :handle_msg, 6, [file: 'gen_event.erl', line: 390]}]}}
Last message: {:gen_event_EXIT, LogflareLogger.HttpBackend, {:EXIT, {%RuntimeError{message: "could not lookup Ecto repo LogflareLogger.Repo because it was not started or it does not exist"}, [{Ecto.Repo.Registry, :lookup, 1, [file: 'lib/ecto/repo/registry.ex', line: 19]}, {Ecto.Repo.Queryable, :execute, 4, [file: 'lib/ecto/repo/queryable.ex', line: 210]}, {Ecto.Repo.Queryable, :all, 3, [file: 'lib/ecto/repo/queryable.ex', line: 17]}, {LogflareLogger.BatchCache, :flush, 1, [file: 'lib/logflare_logger/batch_cache.ex', line: 44]}, {LogflareLogger.HttpBackend, :flush!, 1, [file: 'lib/logflare_logger/http_backend.ex', line: 113]}, {:gen_event, :server_update, 4, [file: 'gen_event.erl', line: 620]}, {:gen_event, :server_notify, 4, [file: 'gen_event.erl', line: 602]}, {:gen_event, :handle_msg, 6, [file: 'gen_event.erl', line: 390]}]}}}
21:43:31.879 [error] :gen_event handler LogflareLogger.HttpBackend installed in Logger terminating
** (RuntimeError) could not lookup Ecto repo LogflareLogger.Repo because it was not started or it does not exist
    (ecto 3.5.6) lib/ecto/repo/registry.ex:19: Ecto.Repo.Registry.lookup/1
    (ecto 3.5.6) lib/ecto/repo/queryable.ex:210: Ecto.Repo.Queryable.execute/4
    (ecto 3.5.6) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3
    (logflare_logger_backend 0.8.0-rc.4) lib/logflare_logger/batch_cache.ex:44: LogflareLogger.BatchCache.flush/1
    (logflare_logger_backend 0.8.0-rc.4) lib/logflare_logger/http_backend.ex:113: LogflareLogger.HttpBackend.flush!/1
    (stdlib 3.13) gen_event.erl:620: :gen_event.server_update/4
    (stdlib 3.13) gen_event.erl:602: :gen_event.server_notify/4
    (stdlib 3.13) gen_event.erl:390: :gen_event.handle_msg/6
Last message: :flushed:

Should I open a separate issue for this?

chasers commented 3 years ago

Yeah that's no good. Odd. No need for another issue. We'll fix.

gvaughn commented 3 years ago

If you think rc-6 would fix it, we can give that a shot.

chasers commented 3 years ago

No this is something else. Perhaps the Elixir version. Just checked console on our prod cluster and it's not there.

gvaughn commented 3 years ago

The BatchCache.flush is querying an Ecto Repo?

chasers commented 3 years ago

The BatchCache is a Repo backed by Etso (makes an Ecto Repo backed by ETS). Curious it's even working if it's not started.

darksheik commented 3 years ago

@chasers I have some new information for you. I don't see the above error about the Repo in the server logs until I launch a remote console into our running release. Then I see them appear every second. So it seems directly related.

The same is true if I try to run migrations through the release commands.

So that's some good news and explains why we're not seeing any errors reported to our error tracking service.

chasers commented 3 years ago

Huh ... yeah I confirmed I am not seeing those errors either. On my test app I have it running on Gigalixir and the log drain goes to Logflare so I can see any console errors there. And on the Google Cloud our instances log console errors to Cloud Logging and no errors there either.

chasers commented 3 years ago

Where are you hosting?

darksheik commented 3 years ago

We're running Ubuntu on an AWS EC2 instance. Launching remote console with:

bin/<app> remote_console

chasers commented 3 years ago

We'll continue to look into this btw. Don't want that nonsense there when you're in remote console.

darksheik commented 3 years ago

I will see if I can duplicate by building a release locally on my dev machine.

darksheik commented 3 years ago

Was able to build a release on my dev machine and connect to it with a remote console, but I don't see the error. I'll keep poking around for what might be unique to our staging/production environments.

darksheik commented 3 years ago

My mistake. I had to remember to actually add LogflareLoggerBackend to my dev environment. Now I absolutely do see the error, and the behavior is similar. One instance of the error occurs on startup, then connecting with a remote console causes it occur once per second.

We're building our releases with distillery, if that helps. Another consideration is that we're running an umbrella app on which several of the other apps depend. LogflareLoggerBackend is set up in one of the umbrella apps that all other share.

darksheik commented 3 years ago

A couple more observations about this:

When I attach my remote console, I see the error repeat once per second. However, if I send more log events while on that console, I'll see the error repeat multiple times per second for each log sent. (Something simple like Logger.info("test"))
I see the error when I run migrations through release tasks, but not always. Seems to be related to whether the migrations finish before the GenServer attempts the flush.

chasers commented 3 years ago

Thanks, yeah we tried all morning to reproduce and can't do it. If you can provide a public repo that reproduces this error that would help us a lot. I'll go ahead and turn this into a new issue also.

Logflare / logflare_logger_backend

Backend seems to freeze periodically #51