Closed darksheik closed 3 years ago
We were definitely exceeding the sync_threshold. I captured this during troubleshooting. I assume the :httpc.handle_answer/3
call is waiting for an acknowledgement from Logflare
Thanks for this report. I don't think I've experienced this and I've sent thousands per second from our production servers before to load test.
Seems like I should be able to replicate it if I set the Logflare URL to something that just doesn't exist.
We'll figure it out and get a fix out. I'll let you know if I need any more info. Thanks again.
@gvaughn it looks like you're on Elixir 1.10? Can you upgrade to the latest Elixir? Seems like this would be a part of the problem:
https://github.com/elixir-lang/elixir/issues/10420
This was confusing because our backend uses Hackney and hackney doesn't use httpc.
Our backend causing sync mode is another issue. On accepting messages from Logger all we do is cache those in batches but it looks like flush!
would block the genserver loop if we're responding abnormally slow which was the case for a bit yesterday. We'll wrap that in a Task which will fix this I think.
I'll do some more load testing forcing sync mode too. And make sure that our backend crashing will never have adverse effects on your app.
Yeah, we looked at that issue. I'm not convinced that's us because we don't have handle_otp_reports
enabled and our stacktrace is not stuck in :gen.do_call/4
. However, now that you tell me that logflare_backend uses hackney, I can't explain our stacktrace.
We're looking into upgrading Elixir plus probably customizing more Logger settings.
@chasers I haven't really dug into Logger much before. If I'm reading the docs right and we set the sync_threshold
and discard_threshold
to the same value, I think we'd avoid sync mode completely. Is that your understanding? That might be a decent stopgap/insurance for us.
@gvaughn yeah, I think so. I would maybe set them at 1000. Can you email me your Logflare source id? Or the email you used to sign up with? Email support@logflare.app.
I am QAing some updates currently btw.
@gvaughn @darksheik I've got rc.4 published on Hex and running on a test system and rolled out to our production cluster just now. All looks good so far.
@chasers Thanks so much! We can try it out in a bit.
We notice that you have moved to Mint + Finch - but hackney
is still a dependency. Is that an oversight or do you still need it?
One other note is we got this warning
==> logflare_api_client
warning: the dependency :logflare_api_client requires Elixir "~> 1.11" but you are running on v1.10.4
Is that really dependent on 1.11 features? We've deployed it to our staging environment with no issues so far, but wanted to double-check.
Good call on both ... we will clean up the deps shortly.
@gvaughn @darksheik rc.6 is published for your testing pleasure.
We've already got rc.4
running in prod and it has been working great through our busiest time of the day.
Great! I'm going to let this run on prod until tomorrow and then bump the official version.
@chasers Everything seems to be working in prod, however I just started a remote console to my server and I'm seeing this message repeating every second or so.
21:43:30.877 [error] GenServer #PID<0.140.0> terminating
** (stop) {:EXIT, {%RuntimeError{message: "could not lookup Ecto repo LogflareLogger.Repo because it was not started or it does not exist"}, [{Ecto.Repo.Registry, :lookup, 1, [file: 'lib/ecto/repo/registry.ex', line: 19]}, {Ecto.Repo.Queryable, :execute, 4, [file: 'lib/ecto/repo/queryable.ex', line: 210]}, {Ecto.Repo.Queryable, :all, 3, [file: 'lib/ecto/repo/queryable.ex', line: 17]}, {LogflareLogger.BatchCache, :flush, 1, [file: 'lib/logflare_logger/batch_cache.ex', line: 44]}, {LogflareLogger.HttpBackend, :flush!, 1, [file: 'lib/logflare_logger/http_backend.ex', line: 113]}, {:gen_event, :server_update, 4, [file: 'gen_event.erl', line: 620]}, {:gen_event, :server_notify, 4, [file: 'gen_event.erl', line: 602]}, {:gen_event, :handle_msg, 6, [file: 'gen_event.erl', line: 390]}]}}
Last message: {:gen_event_EXIT, LogflareLogger.HttpBackend, {:EXIT, {%RuntimeError{message: "could not lookup Ecto repo LogflareLogger.Repo because it was not started or it does not exist"}, [{Ecto.Repo.Registry, :lookup, 1, [file: 'lib/ecto/repo/registry.ex', line: 19]}, {Ecto.Repo.Queryable, :execute, 4, [file: 'lib/ecto/repo/queryable.ex', line: 210]}, {Ecto.Repo.Queryable, :all, 3, [file: 'lib/ecto/repo/queryable.ex', line: 17]}, {LogflareLogger.BatchCache, :flush, 1, [file: 'lib/logflare_logger/batch_cache.ex', line: 44]}, {LogflareLogger.HttpBackend, :flush!, 1, [file: 'lib/logflare_logger/http_backend.ex', line: 113]}, {:gen_event, :server_update, 4, [file: 'gen_event.erl', line: 620]}, {:gen_event, :server_notify, 4, [file: 'gen_event.erl', line: 602]}, {:gen_event, :handle_msg, 6, [file: 'gen_event.erl', line: 390]}]}}}
21:43:31.879 [error] :gen_event handler LogflareLogger.HttpBackend installed in Logger terminating
** (RuntimeError) could not lookup Ecto repo LogflareLogger.Repo because it was not started or it does not exist
(ecto 3.5.6) lib/ecto/repo/registry.ex:19: Ecto.Repo.Registry.lookup/1
(ecto 3.5.6) lib/ecto/repo/queryable.ex:210: Ecto.Repo.Queryable.execute/4
(ecto 3.5.6) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3
(logflare_logger_backend 0.8.0-rc.4) lib/logflare_logger/batch_cache.ex:44: LogflareLogger.BatchCache.flush/1
(logflare_logger_backend 0.8.0-rc.4) lib/logflare_logger/http_backend.ex:113: LogflareLogger.HttpBackend.flush!/1
(stdlib 3.13) gen_event.erl:620: :gen_event.server_update/4
(stdlib 3.13) gen_event.erl:602: :gen_event.server_notify/4
(stdlib 3.13) gen_event.erl:390: :gen_event.handle_msg/6
Last message: :flushed:
Should I open a separate issue for this?
Yeah that's no good. Odd. No need for another issue. We'll fix.
If you think rc-6 would fix it, we can give that a shot.
No this is something else. Perhaps the Elixir version. Just checked console on our prod cluster and it's not there.
The BatchCache.flush is querying an Ecto Repo?
The BatchCache is a Repo backed by Etso (makes an Ecto Repo backed by ETS). Curious it's even working if it's not started.
@chasers I have some new information for you. I don't see the above error about the Repo in the server logs until I launch a remote console into our running release. Then I see them appear every second. So it seems directly related.
The same is true if I try to run migrations through the release commands.
So that's some good news and explains why we're not seeing any errors reported to our error tracking service.
Huh ... yeah I confirmed I am not seeing those errors either. On my test app I have it running on Gigalixir and the log drain goes to Logflare so I can see any console errors there. And on the Google Cloud our instances log console errors to Cloud Logging and no errors there either.
Where are you hosting?
We're running Ubuntu on an AWS EC2 instance. Launching remote console with:
bin/<app> remote_console
We'll continue to look into this btw. Don't want that nonsense there when you're in remote console.
I will see if I can duplicate by building a release locally on my dev machine.
Was able to build a release on my dev machine and connect to it with a remote console, but I don't see the error. I'll keep poking around for what might be unique to our staging/production environments.
My mistake. I had to remember to actually add LogflareLoggerBackend
to my dev environment. Now I absolutely do see the error, and the behavior is similar. One instance of the error occurs on startup, then connecting with a remote console causes it occur once per second.
We're building our releases with distillery
, if that helps. Another consideration is that we're running an umbrella app on which several of the other apps depend. LogflareLoggerBackend is set up in one of the umbrella apps that all other share.
A couple more observations about this:
Logger.info("test")
)Thanks, yeah we tried all morning to reproduce and can't do it. If you can provide a public repo that reproduces this error that would help us a lot. I'll go ahead and turn this into a new issue also.
Sorry if this is vague, but we had a strange situation on two separate days where our system seemed to be locking up completely and we couldn't pin it down. By process of elimination we removed
LogflareLogger.HttpBackend
from our logger config and everything was fine with only our file backend running.In this situation below, a simple database call was hanging and ultimately died - and every other process that was trying to log something at the same time also hung, until it crashed and restarted. Then we'd get a burst of activity for the better portion of a minute and it would lock up again. This was happening on both of our production nodes.
Are there known issues with the backend that might cause this, or are we possibly attempting to log more than it can handle? Is there a setting we might be able to adjust?
This started occurring today at around 2p ET. There is usually an increase of logging activity in our system around this time. One of our developers thinks perhaps we exceeded this, but just a guess: