Collect Ecto stacktraces

gmile commented 1 year ago

It's nice to see slow queries coming from Ecto in the dashboard. However, oftentimes it's really hard to match the query to the code that emits it. For example, I'm looking at a vast codebase that spans over a 1000 of modules, and it's really challenging sometimes to locate Elixir/Ecto code that produces slow query.

For any given query reported in AppSignal, it would be great to be able to trace it back to the code that produced it.

Ecto reports stacktraces in telemetry events for a while now, see: https://github.com/elixir-ecto/ecto/pull/3798. It would be really useful to have that in AppSignal.

There's also a discussion about this topic on ElixirForum atm: https://elixirforum.com/t/tracking-down-slow-queries-in-ecto/58121

crbelaus commented 1 year ago

This would be a very welcome improvement. I am also in the process of fixing some slow queries in my application (also very big) and sometimes it is really hard to correlate the query with the code that triggers it.

jeffkreeftmeijer commented 1 year ago

Hey Ievgen and Cristian,

I do agree that it might be useful to know where queries are executed, but we're currently not set up for that. Right now, events don't support stack traces unless they have an error attached to them.

Also, I'm not sure what the performance impact of reporting stack traces for every query would be. Not specifically on AppSignal's side, as adding data to the span should be fairly quick, but calculating the stack trace might have an impact on high-throughput applications, so I'm not sure if I'd recommend this for production use.

I'll keep this issue open to allow for some discussion about this and to maybe pick this up in the future, but it's not something we'll be picking up in the coming period, I'm afraid.

crbelaus commented 1 year ago

While fixing slow queries I've found two different situations:

Slow queries that are directly executed in a Phoenix request or an Oban job are correctly linked to the code that runs them. If there is an error, it show in the "web" or "oban" namespace accordingly.
Other queries seem not to be linked to any part of our codebase so there is no way to find the offending code other than searching for the query manually. In this case errors are not linked to the code either and show in the "background" namespace.

I think that the second kind of query is the problematic one. When there are errors I always see the following in AppSignal's stacktrace:

lib/ecto/adapters/sql.ex:913 Ecto.Adapters.SQL.raise_sql_call_error/1
lib/ecto/adapters/sql.ex:828 Ecto.Adapters.SQL.execute/6
lib/ecto/repo/queryable.ex:229 Ecto.Repo.Queryable.execute/4
lib/ecto/repo/queryable.ex:19 Ecto.Repo.Queryable.all/3
lib/ecto/repo/preloader.ex:272 Ecto.Repo.Preloader.fetch_query/8
lib/task/supervised.ex:101 Task.Supervised.invoke_mfa/2
lib/task/supervised.ex:36 Task.Supervised.reply/4

The Ecto.Repo.Preloader is always present so this makes me think that those queries may be the result of a parallel preload, which is not currently being linked to the parent span, so AppSignal cannot correlate it with the proper action.

This has been reported on #603 and I think that it may be the same issue. In #603 there is a link to an ElixirForum post with a potential fix to link parallel preload queries to the parent span but I did not try it yet myself.

unflxw commented 1 year ago

@crbelaus @gmile We're considering supporting this feature by adding the last line of the backtrace (the one further from Ecto's internals) in the event. Here's a quick mockup with placeholder text:

You would be able to click on a "git" icon next to it and see the corresponding source code in GitHub. Would this be useful to you?

gmile commented 1 year ago

Hi, @unflxw! Happy to hear AppSignal is considering implementing this feature. Noted that you're considering to report a location line emitted right before Ecto's internals.

Regarding your screenshot, would lib/task/supervised.ex:36 be the location where the query originates from? Or is this just happens to be a non-ideal example?

My point is, the chosen line of backtrace should be pointing to a location in application code, and not from Ecto nor from Elixir internals. But on your screenshot it appears to be coming from Elixir internals, as suggested by lib/task/supervised.ex:36, which is less than ideal I think 🤔

unflxw commented 1 year ago

Hi @gmile! My apologies, it is placeholder text -- I quickly copy-pasted a line of an Elixir backtrace from your comment above.

Ideally we'd pick the line in the backtrace that is furthest from Ecto internals, as that line is the most likely one to be close to the developer's code, and as such meaningful to the developer. Going from the sample logs in https://github.com/elixir-ecto/ecto/pull/3798#issue-1078689866, that would be EctoBug.reproduce/1 :)

gmile commented 1 year ago

@unflxw noted that it's a sample glitch :) In this case the answer to your question is yes, this certainly would be very useful!

crbelaus commented 1 year ago

I also agree with @gmile, it would be extremely useful!

unflxw commented 11 months ago

Since this issue is seen as a workaround for #603, I have decided to first approach fixing #603 in isolation, then look into implementing this.

While it is true that having stacktrace information would work around the loss of context on the query caused by #603, the two are fundamentally different issues.

backlog-helper[bot] commented 2 months ago

This issue has not had any activity in 14 days. Please provide a status update if it is still relevant. Closed it if it is no longer relevant. Or move it to another column if it's blocked or requires another look at it. - (More info)

This is a message from the daily scheduled checks.

New issue guide | Backlog management | Rules | Feedback

appsignal / appsignal-elixir

Collect Ecto stacktraces #885