W1ldPo1nter / django-queryable-properties

Write Django model properties that can be used in database queries.
BSD 3-Clause "New" or "Revised" License
71 stars 1 forks source link

Access to normal model Field behaves like .defer() was applied to it #5

Closed ShaheedHaque closed 3 years ago

ShaheedHaque commented 3 years ago

I recently added usage of version 1.7.0 of this library to implement some features, with excellent results. However, I recently noticed that access to normal model fields (i.e. based on django.db.models.Field not QueryablePropertyDescriptor) was behaving as if each of the normal fields had had queryset.defer('field_name') applied to it. This causes an explosion in the number of round trips by roughly number-of-fields * queryset-instances, and thus terrible performance at even moderate scale.

Of course, I only noticed this when it got real data to play with, and long after merging, so any help would be most welcome :-).

Here are some details of a case that might be relevant:

Posting all the relevant code is rather impractical, but I am of course happy to provide any details (or debug anything) which might be helpful.

ShaheedHaque commented 3 years ago

I forgot to add...

I am using django-debug-toolbar to see what queries are used in the view. I can also use django-debug-toolbar's CLI to construct and run queries by hand. Using the CLI, I can see that the number and content of roundtrips does not seem to explode:

$ ./manage.py debugsqlshell
...
In [4]: qs = PayRun.objects.filter(company__legal_name='Advanced Mechanics').defer('snapshot')

In [5]: prefetch_queryable_properties(qs, 'last_run') 
SELECT t.oid,
       typarray
FROM pg_type t
JOIN pg_namespace ns ON typnamespace = ns.oid
WHERE typname = 'hstore' [0.83ms]
SELECT typarray
FROM pg_type
WHERE typname = 'citext' [0.16ms]
SELECT "viewflow_process"."id",
       "viewflow_process"."flow_class",
       "viewflow_process"."status",
       "viewflow_process"."created",
       "viewflow_process"."finished",
       "viewflow_process"."artifact_content_type_id",
       "viewflow_process"."artifact_object_id",
       "viewflow_process"."data",
       "paiyroll_payrun"."process_ptr_id",
       "paiyroll_payrun"."schedule_id",
       "paiyroll_payrun"."custom",
       "paiyroll_payrun"."on_demand",
       "paiyroll_payrun"."company_id",
       "paiyroll_payrun"."frequency",
       "paiyroll_payrun"."approved",
       "paiyroll_payrun"."target_t",
       "paiyroll_payrun"."actual_t",
       "paiyroll_payrun"."next_t",
       "paiyroll_payrun"."result",
       "paiyroll_payrun"."session_id"
FROM "paiyroll_payrun"
INNER JOIN "paiyroll_company" ON ("paiyroll_payrun"."company_id" = "paiyroll_company"."id")
INNER JOIN "viewflow_process" ON ("paiyroll_payrun"."process_ptr_id" = "viewflow_process"."id")
WHERE "paiyroll_company"."legal_name" = 'Advanced Mechanics'
ORDER BY "viewflow_process"."created" DESC [1.96ms]
SELECT "paiyroll_payrun"."process_ptr_id",
       COALESCE(to_char(
                          (SELECT U0."created"
                           FROM "viewflow_task" U0
                           INNER JOIN "viewflow_process" U1 ON (U0."process_id" = U1."id")
                           INNER JOIN "paiyroll_payrun" U2 ON (U1."id" = U2."process_ptr_id")
                           WHERE (U0."flow_task" = 'paiyroll/flows.payrun.PayRunFlow.run_payrun'
                                  AND U2."process_ptr_id" = "paiyroll_payrun"."process_ptr_id")
                           ORDER BY U0."created" DESC
                           LIMIT 1), 'YYYY-MM-DD HH24:MI:ss'), '') AS "last_run"
FROM "paiyroll_payrun"
INNER JOIN "viewflow_process" ON ("paiyroll_payrun"."process_ptr_id" = "viewflow_process"."id")
WHERE "paiyroll_payrun"."process_ptr_id" IN (129,
                                             132,
...
                                             383)
ORDER BY "viewflow_process"."created" DESC [4.84ms]

and the query then runs no extra round trips:

In [6]: for o in qs:
   ...:     print(o.frequency,o.last_run)           <<<<<<<<<<<<<<<< frequency is a normal CharField, and
                                                    <<<<<<<<<<<<<<<< last_run is a queryable_property
   ...: 
m1 2021-08-12 09:21:26
m1 2021-08-05 15:03:26
...
m1 2019-04-07 11:49:56

whereas in the GUI case, django-debug-toolbar shows about 25 extra queries for the frequency field, one for each instance of the pay run:

SELECT "viewflow_process"."id",
       "viewflow_process"."flow_class",
       "viewflow_process"."status",
       "viewflow_process"."created",
       "viewflow_process"."finished",
       "viewflow_process"."artifact_content_type_id",
       "viewflow_process"."artifact_object_id",
       "viewflow_process"."data",
       "paiyroll_payrun"."process_ptr_id",
       ("paiyroll_payrun"."snapshot" -> 'company') AS "snapshot_company",
       ("paiyroll_payrun"."snapshot" -> 'pay_run') AS "snapshot_pay_run",
       ("paiyroll_payrun"."snapshot" -> 'report_definition') AS "snapshot_report_definition",
       ("paiyroll_payrun"."snapshot" -> 'report_template') AS "snapshot_report_template",
       ("paiyroll_payrun"."snapshot" -> 'code_version') AS "snapshot_code_version"
  FROM "paiyroll_payrun"
 INNER JOIN "paiyroll_company"
    ON ("paiyroll_payrun"."company_id" = "paiyroll_company"."id")
 INNER JOIN "viewflow_process"
    ON ("paiyroll_payrun"."process_ptr_id" = "viewflow_process"."id")
 WHERE ("paiyroll_payrun"."company_id" = 173 AND "paiyroll_company"."client_id" = 52 AND "paiyroll_payrun"."actual_t" >= '2021-04-06'::date AND "paiyroll_payrun"."actual_t" <= '2022-04-05'::date)
 ORDER BY "paiyroll_payrun"."actual_t" DESC, "viewflow_process"."created" DESC
 LIMIT 25
...
... This real query has several more queryable_properties, and is generally more complex, details not shown for brevity 
...
SELECT "paiyroll_payrun"."process_ptr_id",
       "paiyroll_payrun"."frequency"
  FROM "paiyroll_payrun"
 WHERE "paiyroll_payrun"."process_ptr_id" = 24073
 LIMIT 21 25 similar queries.
...
SELECT "paiyroll_payrun"."process_ptr_id",
       "paiyroll_payrun"."frequency"
  FROM "paiyroll_payrun"
 WHERE "paiyroll_payrun"."process_ptr_id" = 24288
 LIMIT 21 25 similar queries.
W1ldPo1nter commented 3 years ago

Hey, thanks for the report.

First off: the code you've shown in the second post isn't actually that good of a use case for prefetch_queryable_properties. This utility is intended to be used whenever you are not already performing a query for the objects, but getting them from somewhere else, e.g. if you were loading your PayRun instances from a cache instead of querying them from the DB directly. That's why the debug SQL shows two queries: one for the objects themselves and one for the property values. You could do both things in a single query by simply using .select_properties(...) as described here. In your case, the queryset could be constructed like this:

PayRun.objects.filter(company__legal_name='Advanced Mechanics').defer('snapshot').select_properties('last_run')

Besides that, I'm somewhat confused due to everything seemingly working correctly in your debug shell, which makes me think that the problem may somehow be related to your view itself and unrelated to queryable properties, but I obviously can't say for certain. Are there other queryset operations involved beside the .filter(...) and the .defer(...) calls shown in the example? Are other queryable properties being queried, maybe even properties of a related model? I would like to try and reproduce the problem using the test setup.

Could you also tell me which Django version you are using?

ShaheedHaque commented 3 years ago

It's probably worth saying that I'm teetering on the edge of my comfort zone with the ORM generally and queryable_properties is somewhat dragging me over the edge, so it is quite possible I am being a noob.

I do however see your point abut the second query. Note also that I am only really using prefetch_queryable_properties() even in the case of the first query because I am retrofitting this to the existing stack of mixins and so on that generate the query. It let's me conveniently "post process" my abstract query representation to do the prefetch, rather than rework the stack with "native" support for .select_properties().

The Django version is 3.2.7.

The query is, as I hinted...complicated. I'm not sure what is relevant, but:

Roughly speaking, the query should be doing something like this:

    PayRun.objects. \
        filter(company__somestuff='xxx', a_date__gt='xxx', a_date__lte='xxx'). \ # A company, for a given date range.
        defer('snapshot'). \                                                     # Don't fetch huge JSON field.
        annotate(...a few of my own...). \                                       # Some extracts of the huge JSON field.
        annotate(...whatever magic queryable_properties does...)                 # 5 separate queryable_properties,
                                                                                 # one of which joins a Task model
                                                                                 # to the Process model.

P.S. I have been trying to debug things. I am currently in QueryablePropertiesIterable.__iter__() trying to see why the list of columns to be fetched seems to be being truncated to just the fields in Process, and dropping the extra fields of the PayRun.

ShaheedHaque commented 3 years ago

OK, the fields from PayRun were being dropped from the query because in my ignorance of the ORM's .defer() capability, one of my mixins which was dealing with the polymorphic aspect of using Viewflow's Process was explicitly killing the missing fields.

Apologies for wasting your time, and thanks for your quick reply in any event.

W1ldPo1nter commented 3 years ago

No worries, I didn't even invest that much time into it yet (wanted to play around with the inheritance + .defer() + .filter() combination later today myself). I'm glad that it's working out for you and that there's no bug here.