django-es / django-elasticsearch-dsl

This is a package that allows indexing of django models in elasticsearch with elasticsearch-dsl-py.
Other
1.02k stars 263 forks source link

Performance issue when populate millions documents #86

Open ahivert opened 6 years ago

ahivert commented 6 years ago
Problem

When we need to put a lot of documents in index, we need to use queryset_pagination meta option to paginate. Django pagination need a sorted queryset with order_by (cf doc) otherwise same pk can be present more than once and others missing (like #71).

Put order_by on queryset will make django paginator call order_by for each page. Call order_by on huge queryset (like 10 millions) will lead to a huge perfomance issue.

Temporary solution:

We can override _get_actions method (from django_elasticsearch_dsl.documents.DocType) to not use django paginator when a queryset is passed. More over because of the way a database index work, we should first fetch only pks, and then do sub request based on it.

from django.db.models.query import QuerySet

def _get_actions(self, object_list, action):
    if self._doc_type.queryset_pagination and isinstance(object_list, QuerySet):
        pks = object_list.order_by('pk').values_list('pk', flat=True)
        len_pks = len(pks)
        for start_pk_index in range(0, len_pks, self._doc_type.queryset_pagination + 1):
            end_pk_index = start_pk_index + self._doc_type.queryset_pagination
            if end_pk_index >= len_pks:
                end_pk_index = len_pks - 1
            ranged_qs = object_list.filter(pk__range=[
                pks[start_pk_index],
                pks[end_pk_index]
            ])
            for object_instance in ranged_qs:
                yield self._prepare_action(object_instance, action)
    else:
        yield from super()._get_actions(object_list, action)

Available to make the PR if needed.

peterldowns commented 6 years ago

I'd appreciate this as a PR! Even though the order_by is a performance hit by default, with a database index on the primary key it shouldn't be too bad, and I'd rather have a slow way of indexing all of my documents than no way (because otherwise I've been running out of memory.)

sabricot commented 6 years ago

@ahivert Thank you for the report and the fix. I understand the problem. I can't test myself the performance issue on really big tables, but if you have tested it, you are really welcome to make a PR :)

ahivert commented 6 years ago

PR submitted ! :) can talk about it

mjl commented 5 years ago

Note that "yield from" is py3.

I have implemented something similar for django-haystack, let me see if that can be dropped in here.