elixir-ecto / ecto_sql

SQL-based adapters for Ecto and database migrations
https://hexdocs.pm/ecto_sql
Apache License 2.0
578 stars 312 forks source link

Add option to generate generic explain plans for Postgres #604

Closed greg-rychlewski closed 7 months ago

greg-rychlewski commented 7 months ago

This was brought up in the mailing list. A lot of the times Postgres will end up choosing a generic plan for the same parameterized query if it's used more than 5 times. At which point our current explain logic can be misleading because it's based on the plan created specifically for the supplied parameters.

There is a way you can get Postgres to generate the generic explain plan detailed here: https://www.cybertec-postgresql.com/en/explain-that-parameterized-statement/. Basically you have to use the prepare command explicitly. Otherwise it treats as a custom plan.

Besides the integration tests, there's also another sanity test that can be done to convince ourselves it's working properly. First create an index on :title in the :posts table then run this test:

@tag :generic
  test "generic does not use index" do
    for _ <- 1..100000 do
      TestRepo.insert(%Post{title: "hi"})
    end

    TestRepo.query!("ANALYZE posts")
    q1 = from p in Post, where: p.title == ^"hi2"
    q2 = from p in Post, where: p.title == ^"hi"

    # q1 should use the index if it's a custom plan but not the generic plan
    # the custom plan will know that there are no values where title = "hi2", therefore an index scan will be fast
    # the generic plan mimics the behaviour for "hi", where a seq scan is most efficient because all rows have that value
    IO.inspect TestRepo.explain(:all, q1, generic_plan: true)
    IO.inspect TestRepo.explain(:all, q1)
    IO.inspect TestRepo.explain(:all, q2, generic_plan: true)
    IO.inspect TestRepo.explain(:all, q2)
  end
greg-rychlewski commented 7 months ago

I just realized I should probably call some things out to see what you guys think. In the mailing list discussion their proposal was a bit different.

I believe they are saying we should take the actual prepared statement name from Ecto's cache and put it in the :cache_statement option when we call Postgrex.query. Then we get the same behaviour that would really happen.

The issue I see with this is that a lot of the time people will be debugging bad queries locally or in a sandbox environment when the issue is happening in a production system. So it seems more valuable to be able to control what kind of plan you want to see.

The caveat here is that analyze is essentially useless. Since it runs the actual query we need to be able to inline the text representation of all the parameters but there isn't really a good way to do that. That's why we are using NULL placeholder, as recommended in the reference in my initial post. The analyze option is actually explicitly disallowed by Postgres as well in their official implementation (only available since postgres 16).

So then we have the option to raise when analyze is specified when generic_plan is true or to document the caveat. I would personally raise and then if sometime in the future Postgres allows it, someone will complain to us that we need to enable it.

josevalim commented 7 months ago

Maybe we can make :generic_plan a special key of :analyze? analyze: :generic_plan? Or that does not make sense? I am not familiar with the EXPLAIN API in Postgrex, so I am more than happy to defer to you. I also wonder which behaviour should be the default?

greg-rychlewski commented 7 months ago

Oh :analyze is orthogonal to whether it is generic or not. Analyze is whether the actual query is run and the real time it takes is reported. If analyze is not specified then only an estimated time is given. And generic plan is whether Postgres plans the query execution ignoring the specific parameter values.

It is useful to be able to see the explain for that because the default behaviour of Postgres is to use a generic plan after 5 executions of the same query if it's not much worse than the average of the first 5 runs (which are using plans specific to the values of the parameters). So being able to see both types plans give you the full picture while you're debugging.

I would personally default to false because it mirrors the default in Postgres. And it's less known/used and doesn't play nice with some other options.

josevalim commented 7 months ago

@greg-rychlewski alright, sounds good to me. One last question before you merge: should it be called generic_plan: true or should it be: plan: :auto | :generic | :custom? That would match PostgreSQL more closely and give us more options in the future. :)

greg-rychlewski commented 7 months ago

Yeah definitely that's a lot better. I think just :custom or :generic though because :auto is how it behaves over time (try custom plans 5 times in a row then compare the average of those to the generic plan on the 6th try and use the better one going forward).

And then there are just two other things I was hoping to your opinion on:

  1. Before Postgres 16, Postgres didn't have any built-in way to show you a generic explain plan with a single command. The way it is implemented in this PR is the workaround people used before then, as described in this article. Would you prefer to use the built-in way? I implemented the workaround so more people can use it.
  2. Postgres's built-in way does not allow :analyze to be used because it requires actual parameter values to run the query. The workaround does not prohibit analyze but the results can be misleading because we are not using the actual supplied parameter values we are using NULL, which can give misleading results if people are expecting it to use the actual parameter values. I can either document this or raise. Do you have a preference? My preference is to raise.
josevalim commented 7 months ago
  1. It is up to you. We can also make it three different options: :custom, :generic, :fallback_generic. And then we can deprecate fallback generic in the future once we can reliably expect most people to be running on PG16.

  2. :+1: for raising!

greg-rychlewski commented 7 months ago

Thanks for the good suggestions. When you have time would you be able to give it one more look over?

Also I tried to get Postgres 16 working on the CI for the :generic option but something weird is happening with it. If it's alright with you, I'll investigate it locally and fix it in another PR?

josevalim commented 7 months ago

If it's alright with you, I'll investigate it locally and fix it in another PR?

Yup, sounds good to me!