`QueryOptimizerStore` overfetches rows in `.prefetch_related` when using with Relay

rsomani95 commented 9 months ago

I have read the docs thoroughly before making this bug report.

[X] Yes

I have read through other open issues, and my issue is not a duplicate.

[X] Yes

What version of the library you are using?

0.1.9

Which python version are you using?

3.12

What operating system are you on?

Mac

Description

Hi. Firstly, this library looks incredible and I want to thank you for your work on it. In my testing so far, it already seems better than the most recent working fork I'm aware of, while also being easier to work with :)

I found one edge case where the prefetching seems suboptimal in a relay setting. I'm testing with this query from the test models:

{
  pagedHousingCompanies {
    edges {
      node {
        id
        name
        realEstates {
          edges {
            node {
              name
            }
          }
        }
      }
    }
  }
}

I've set the relay page size to be 3, so in this case only 3 realEstates (1 per housing company) are returned:

Returned JSON data with page size = 3

```json { "data": { "pagedHousingCompanies": { "edges": [ { "node": { "id": "SG91c2luZ0NvbXBhbnlOb2RlOjk=", "name": "Avila, Singh and Harvey", "realEstates": { "edges": [ { "node": { "name": "fillsoonmeasure" } } ] } } }, { "node": { "id": "SG91c2luZ0NvbXBhbnlOb2RlOjE0", "name": "Burns-Ewing", "realEstates": { "edges": [ { "node": { "name": "trueoridea" } } ] } } }, { "node": { "id": "SG91c2luZ0NvbXBhbnlOb2RlOjEx", "name": "Cole and Sons", "realEstates": { "edges": [ { "node": { "name": "historyresourcecivil" } } ] } } } ] } }, } ```

This is where I've applied this setting in my fork

When doing so, I'd expect the optimizer to prefetch only the 3 realEstates we need, and not all of them. However, it seems to be prefetching all of the realEstates (20 of them). I figured this out by placing this debugging code inside the QueryOptimizerStore.optimize_queryset code: https://github.com/rsomani95/graphene-django-query-optimizer/blob/cc6e42e7bd5f7b07806c703c75ac5666c3799cbe/query_optimizer/store.py#L94-L105

which prints the size of the prefetched queryset in my terminal:

2024-02-21 19:22:56.031 | DEBUG    | query_optimizer.store:optimize_queryset:101 - Compiling for model: <class 'tests.example.models.HousingCompanyProxy'>
2024-02-21 19:22:56.037 | DEBUG    | query_optimizer.store:optimize_queryset:105 - Prefetch model (class,count) : [(<class 'tests.example.models.RealEstate'>, 20)]

Thoughts?

PS: My fork has a more minimized setup compared to your test setup. However, I did reproduce the issue on your main branch as well. Let me know if/how I can help out. Thanks!

vade commented 9 months ago

Hi there I’m a colleague of @rsomani95 and firstly also want to thank you for the library. It’s a breath of fresh air.

One thing that might be very helpful in some usecases for the prefetching optimizer is adding in business logic, facets or filters.

You can imagine a scenario where you only want to prefetch items for specific user for example. There’s no need to burden python or the database with additional serialization into objects or fetching additional rows than necessary.

I’m not sure where the best place to add this sort of prefetch filtering would be, but I hope that you might consider it.

For our use case, it would make a phenomenal difference in performance

Thanks again

MrThearMan commented 9 months ago

Hi! Thank you for your compliments!

First, the actual fetching and caching process does not happen at the point where you have placed your debugging. It happens here, which you would normally reach from here. However, in case of relay connection fields, we cannot evaluate the queryset (fetch the data to it from the database) and cache the the results yet, because we still need to take care of pagination, which the connection field itself is responsible of, and not the object type the optimizer is on. That's what the optimizer.cache_result check is for (set here) - it delays the queryset evaluation up until here, after the base connection field has done a count query and added the pagination limits specified by settings and/or connection filter values. This is why we need the custom connection field.

BTW, there is a better way to check for the actual executed database queries, which is by using a execute_wrapper. This is how I'm testing for the library currently. In my testing, I've observed no overfetching so far (though I'm a bit suspect of nested connection fields still, so let me know if you find anything).

As for adding business logic / additional filtering, you can define a filter_queryset(queryset, info) method on each object type you want to do additional filtering for, and the optimizer will find this method, and given the queryset is not evaluated there, no additional queries are performed. This method is called for both single and multiple row access (nodes and connections), and it's the place you'd want to use to e.g. filter our rows the current user doesn't have access to wihtout blocking access entirely using permission checks. There is a relevant discussion in this repo as well.

I see now that this is not documented yet, it's only in my graphene-django-extensions docs, so I'll put a reminder to update it here as well.

Side note: I'm currently introducing this library at work, so things might move around as I encounter more edge cases there, hopefully it doesn't cause too much trouble for you.

rsomani95 commented 9 months ago

@MrThearMan thanks so much for the detailed response. Based on your comments, I've updated how I was going about debugging this, and it seems like there actually isn't any overfetching happening! I didn't realise that I could also have checked this via the returned table in the debug panel's detailed (I was only using that to see the no. of queries, not the actual table fetched)

My main source of confusion is actually Django related: when we do a .prefetch_related(), are those items prefetched immediately or lazily? I think the latter but it's hard to say exactly what's going on sometimes... I've run into scenarios with the other optimizer library where query-wise, it was doing the right things, but behind the scenes, it was doing a prefetch of all items of the table and discarding it -- what's the best way to test a scenario like that? This is a bit beyond my scope of understanding, and I'd love to hear your thoughts here.

I'm going to test this more robustly in the coming days as I port our existing project to use this library as well. I'll leave the issue open in case there's any hiccups / edge cases I bump into.

BTW, there is a better way to check for the actual executed database queries, which is by using a execute_wrapper. This is how I'm testing for the library currently. In my testing, I've observed no overfetching so far (though I'm a bit suspect of nested connection fields still, so let me know if you find anything).

I noticed that, but as I understand it, this tests the number of queries made, and not whether the query made fetches the exact number of rows required. In theory, you could fetch all rows from the tables when you don't need it but as long as you're only doing the min. required number of queries, these tests would still pass?

Re. filtering - that looks great. We'll try it out and report back if there's any issue. Thanks again!

MrThearMan commented 9 months ago

My main source of confusion is actually Django related: when we do a .prefetch_related(), are those items prefetched immediately or lazily? I think the latter but it's hard to say exactly what's going on sometimes... I've run into scenarios with the other optimizer library where query-wise, it was doing the right things, but behind the scenes, it was doing a prefetch of all items of the table and discarding it -- what's the best way to test a scenario like that?

The items are fetched immidiately after the parent model is fetched. The prefetch_related_objects function populates a _prefetched_objects_cache dictionary with the querysets for the prefetches, while populating their _result_cache with the fetched database rows. After this, the prefetched objects can be fetched with instance.my_related_objects.all(), which ends up in either here or here (depending on the relation), where the _prefetched_objects_cache is then used to return the data without additional database queries.

Now, one big gotcha here is that (almost) no changes can be made to the related queryset after the prefetch if you want to retain the cache. This is because most operations for querysets create a clone of the queryset, which invalidates both the _prefetched_objects_cache (attribute which is added to the queryset dynamically only when a prefetch is made, and is not carried over) and the _result_cache (does not carry over, and is set to None in the initializer). This is of course for good reasons, since the querysets data would change with additional filters, but for the optimizer, we need to make sure that there aren't any "hidden functions" that clone the queryset anywhere between fetching and serializing it for output.

This is to my best knowledge, and still glosses over some details, so take it with a grain of salt.

So prefetches are never lazy if you actually use prefetch_related, but you might accidenally invalidate them and never receive the data. And since it's not possible to keep use the same queryset for the whole duration of the request, I've implemented a per-request cache where the queryset data is saved and fetched from when moving between different object types in the GraphQL request. See more info in the technical details.

As for testing this, you can try and place a debugger here, and check that when the GraphQL "resolver chain" ends up in a get_node resolver, it fetches the correct row from the cache. For get_queryset resolvers, we do just use the queryset caches, and I've made sure that the they are not invalidated anywhere in the middle with tests. So check the executed queries and compare it against what you would expect to happen, and you might find an edge case where it might not work.

I noticed that, but as I understand it, this tests the number of queries made, and not whether the query made fetches the exact number of rows required. In theory, you could fetch all rows from the tables when you don't need it but as long as you're only doing the min. required number of queries, these tests would still pass?

You're right, I should be checking that the SQL queries actually contain the LIMIT clauses to limit the number of rows. I have checked this for single level connections, but not for nested connections. I should add those tests.

MrThearMan commented 9 months ago

Ok, so I tested how LIMIT works with prefetching, and seems like it is not applied. From searching online, it seems like this is a common django issues with prefetches. Here are the queries executed in the tests.test_optimizer.test_optimizer_relay_connection_nested_filtered:

---------------------------------------------------------------------------
>>> Queries (3):
1) ------------------------------------------------------------------------
SELECT COUNT(*) AS "__count"
FROM "example_building"
2) ------------------------------------------------------------------------
SELECT "example_building"."id"
FROM "example_building"
ORDER BY "example_building"."name" ASC
LIMIT 10
3) ------------------------------------------------------------------------
SELECT "example_apartment"."id",
       "example_apartment"."building_id"
FROM "example_apartment"
WHERE ("example_apartment"."rooms" IS NOT NULL
       AND "example_apartment"."building_id" IN (3,
                                                 20,
                                                 17,
                                                 6,
                                                 10,
                                                 18,
                                                 9,
                                                 5,
                                                 1,
                                                 2))
ORDER BY "example_apartment"."street_address" ASC,
         "example_apartment"."stair" ASC,
         "example_apartment"."apartment_number" DESC
---------------------------------------------------------------------------

Seems like I would need to add some sort of window function in order to support "limiting" nested connection fields properly:

limit_parent = 2
limit_child = 2
buildings = (
    Building.objects.prefetch_related(
        models.Prefetch(
            "apartments",
            queryset=(
                Apartment.objects.alias(
                    _row_number=models.Window(
                        expression=RowNumber(),
                        partition_by=models.F("building_id"),
                    ),
                )
                .filter(_row_number__lte=limit_child)
                .only("pk", "building_id")
            ),
        )
    )
    .only("pk")
    .all()[:limit_parent]
)

The generated SQL is "not that great":

---------------------------------------------------------------------------
>>> Queries (2):
1) ------------------------------------------------------------------------
SELECT "example_building"."id"
FROM "example_building"
ORDER BY "example_building"."name" ASC
LIMIT 2
2) ------------------------------------------------------------------------
SELECT "col1",
       "col2"
FROM
  (SELECT *
   FROM
     (SELECT "example_apartment"."id" AS "col1",
             "example_apartment"."building_id" AS "col2",
             ROW_NUMBER() OVER (PARTITION BY "example_apartment"."building_id") AS "qual0",
                               "example_apartment"."street_address" AS "qual1",
                               "example_apartment"."stair" AS "qual2",
                               "example_apartment"."apartment_number" AS "qual3"
      FROM "example_apartment"
      WHERE "example_apartment"."building_id" IN (6,
                                                  17)
      ORDER BY "example_apartment"."street_address" ASC, "example_apartment"."stair" ASC, "example_apartment"."apartment_number" DESC) "qualify"
   WHERE "qual0" <= 2 ) "qualify_mask"
ORDER BY "qual1" ASC,
         "qual2" ASC,
         "qual3" DESC
---------------------------------------------------------------------------

But this seems like the correct way to do this as I see it. Could be a little tricky to get this to generate for prefetches..

MrThearMan commented 9 months ago

Another question is what does pagination for nested connection fields even mean? I think it's not really something that should happen in the nested connection field, since to use it you would need to fetch all of the parent entities again. Rather, if more rows are required, the client should start using a main level endpoint for the object type starting from where the first query left off.

What sort of use case are you looking for with nested connection fields? In all my uses so far, if I need nested entities from a main level entity, I always want all of them, or I want to filter them, but not paginate them (see my DjangoFilterListField). Therefore, I've never needed nested connection fields.

vade commented 9 months ago

Hey there!

Thank you for all of the thoughtful comments.

Another question is what does pagination for nested connection fields even mean? I think it's not really something that should happen in the nested connection field, since to use it you would need to fetch all of the parent entities again. Rather, if more rows are required, the client should start using a main level endpoint for the object type starting from where the first query left off.

I agree - we cant really do nested pagination in a meaningful way due to the fact that the secondary cursor would affect all downstream secondary items regardless of the count or state. In actuality, we do what you are describing and have fresh queries just for those sub-items (which is a more restful approach and works)

I dont want to speak for @rsomani95 , but what we were fishing for there is: is the nested pagination also optimal? which it sounds like it is, and we have capability with your optimizer to make it more so with some additional business logic, which is freaking awesome.

For our user case, without boring you with the details, we have many hundreds to thousands of top level items, with anywhere from 1 to thousands of nested, items, each of which also may have 0 to 100s of subitems (this is worst of worst case, but possible)

As we scale our data - we very clearly have potential bottlenecks around poor SQL queries.

So most of our concern is based on how can we effectively write queries and avoid N+1 (+ M + O + P haha), and how effectively can the backend be designed to do smart things.

Thanks again - were really excited about this project!

rsomani95 commented 9 months ago

@MrThearMan appreciate the detailed response. I'm still digesting all that info, but wanted to point something out quickly.

But this seems like the correct way to do this as I see it. Could be a little tricky to get this to generate for prefetches..

I think you're right. The author of strawberry-django is dealing with the same issue and seems to have come to the exact same conclusion as yours. See here: https://github.com/strawberry-graphql/strawberry-django/issues/337#issuecomment-1761511574

MrThearMan commented 8 months ago

what we were fishing for there is: is the nested pagination also optimal?

As this comment shows, nested connection fields are not "limited" correctly at the moment, so you would fetch all of your nested entities to memory even if only a handful of them were actually used. The output seems to work fine, so I'm guessing the rest are dropped somewhere, but I'm not sure where.

I think I could have a crack at implementing this. I think at first I could just get it to limit rows based on RELAY_CONNECTION_MAX_LIMIT, and then on offset/before/after/first/last.

MrThearMan commented 8 months ago

Small update on this. To support the window function limiting, I would need to drop support for django 4.0 & 4.1. Since Django officially recommends this, I think this shouldn't be too much of an issue.

MrThearMan commented 8 months ago

Ok, managed to get this working with the windows functions! All the pagination parameters should work as expected, and order_by and model ordering is also accounted for when partitioning. Merged this to main, will make a release in a moment.

rsomani95 commented 8 months ago

@MrThearMan that's awesome! Can't wait to try it out

vade commented 8 months ago

Amazing! This is huge news @MrThearMan

MrThearMan commented 8 months ago

Released in 0.2.0

MrThearMan / graphene-django-query-optimizer