Recommended replacement for django-mptt's add_related_count() method?

jeremystretch commented 3 years ago

Hi, we're looking at migrating from django-mptt and I was wondering if there's a recommended strategy for replacing its add_related_count() TreeManager method. We rely on this to provide a cumulative count of related objects assigned to a nested model.

For example, suppose we have Region instances representing countries, states, and cities, and we have Site instances which can be assigned to any Region. We'd call something like this to annotate a cumulative count of sites for each region:

Region.objects.add_related_count(
    Region.objects.all(),
    Site,
    'region',
    'site_count',
    cumulative=True
)

Replicating MPTT's approach directly doesn't seem to be feasible, since its querying against concrete database fields: Django's Subquery class doesn't understand tree_path generated by CTE. (It yields a FieldError: Cannot resolve keyword 'tree_path' into field.) However I'll admit this is challenging my own skill with SQL.

Has anyone come up with a similar solution using CTE?

matthiask commented 3 years ago

Hi

It's definitely possible to query these values with SQL directly. But, the CTE starts from the roots of the tree and recursively fetches the children so adding the cumulative behavior (which sums totals from the leaves to the root) isn't straightforward.

I briefly looked at the netbox repository. It seems your trees are too large to load the whole tree into RAM and sum up related entries in Python code?

I think I'd attempt solving this with a LATERAL JOIN (if on Postgres) and continue from there. On Postgres it should be relatively easy and safe since it supports arrays out of the box. However, if you have hundreds of thousands of nodes in your tree building it up for each query may be too expensive. There's always the possibility to use a materialized view, see https://schinckel.net/2016/01/19/postgres-tree-shootout-part-3%3A-adjacency-list-using-views/ With tens of thousands of nodes it may still work well according to the blogpost linked above.

Or, you could have a look at https://github.com/django-treebeard/django-treebeard/ – maybe you wouldn't even have to change much since treebeard supports nested sets, a different name for mptt.

I hope this helps :)

matthiask commented 3 years ago

Oops, wrong button.

jeremystretch commented 3 years ago

Awesome info @matthiask, thank you! We're still very much exploring options so I'm going to dig into each of your suggestions some more.

It seems your trees are too large to load the whole tree into RAM and sum up related entries in Python code?

I think that's accurate, yes. In instances where there exist many thousands of nested objects it introduces a huge performance penalty, so we need to keep everything in the database.

arthanson commented 1 year ago

In case someone needs, here is a method to do add_related_count adapted from MPTT. It seems to work in testing but hasn't had real-world usage:

    def add_related_count(
        self,
        queryset,
        rel_model,
        rel_field,
        count_attr,
        cumulative=False,
        extra_filters={},
    ):
        """
        Adds a related item count to a given ``QuerySet`` using its
        ``extra`` method, for a ``Model`` class which has a relation to
        this ``Manager``'s ``Model`` class.

        Arguments:

        ``rel_model``
           A ``Model`` class which has a relation to this `Manager``'s
           ``Model`` class.

        ``rel_field``
           The name of the field in ``rel_model`` which holds the
           relation.

        ``count_attr``
           The name of an attribute which should be added to each item in
           this ``QuerySet``, containing a count of how many instances
           of ``rel_model`` are related to it through ``rel_field``.

        ``cumulative``
           If ``True``, the count will be for each item and all of its
           descendants, otherwise it will be for each item itself.

        ``extra_filters``
           Dict with aditional parameters filtering the related queryset.
        """
        if cumulative:
            SQL = """
            SELECT count(*)
              FROM (
                    SELECT (__tree.tree_path) AS "tree_path"
                    FROM "{table1}" T3 JOIN __tree ON T3."{rel_field}_id" = __tree.tree_pk
                    WHERE ("{table2}"."id" = ANY(tree_path))
                   ) _count
            """

            params = {
                "table1": rel_model._meta.db_table,
                "table2": rel_model._meta.get_field(rel_field).remote_field.model._meta.db_table,
                "rel_field": rel_field,
            }

            return queryset.with_tree_fields().annotate(**{count_attr: RawSQL(SQL.format(**params), {})})
        else:
            current_rel_model = rel_model
            for rel_field_part in rel_field.split("__"):
                current_tree_field = current_rel_model._meta.get_field(rel_field_part)
                current_rel_model = current_tree_field.related_model
            tree_field = current_tree_field

            if isinstance(tree_field, ManyToManyField):
                field_name = "pk"
            else:
                field_name = tree_field.remote_field.field_name

            subquery_filters = {
                rel_field: OuterRef(field_name),
            }
        subquery = rel_model.objects.filter(**subquery_filters, **extra_filters).values("pk")
        return queryset.with_tree_fields().annotate(**{count_attr: SQCount(subquery)})

feincms / django-tree-queries

Recommended replacement for django-mptt's add_related_count() method? #21