django-mptt / django-mptt

Utilities for implementing a modified pre-order traversal tree in django.
https://django-mptt.readthedocs.io/
Other
2.88k stars 467 forks source link

Insert performance and working with large datasets (millions of rows)? #665

Open paramono opened 6 years ago

paramono commented 6 years ago

I've been using django-mptt for a while, and on a smaller dataset everything seems fine.

Recently I tried to use django-mptt for a dataset containing almost 12 million rows, and I noticed that creating/updating model instances is extremely slow. On my machine running postgresql, I've measured average insert speed with and without django-mptt.

Model inheriting from MPTTModel: 1-2 creates/sec Same model, without MPTTModel: 400-450 creates/sec

With this kind of performance, it will take days to fill database with 1mil+ rows. I am well aware that MPTT is a lot more involved when it comes to inserts, so slower performance is expected here. But I would like to know:

Just for info, the dataset I am talking about is Geonames locations: http://download.geonames.org/export/dump/

From my experience, the hierarchy there never goes deeper than 9 levels, [0-8] inclusive range. I found some answers on StackOverflow regarding performance, for example this one: https://stackoverflow.com/questions/34496878/django-mptt-postgres-update-query-runs-slowly

But I wanted to know whether "it simply doesn't work" is the final answer here. Perhaps, if django-mptt users have some comments on the topic, it would be helpful to update documentation so that new users are better aware when to use or not to use MPTT in their projects.

mikekeda commented 6 years ago

for large dataset you could try to use graph database, for example neo4j check this module for integration with django - https://github.com/neo4j-contrib/django-neomodel

matthiask commented 6 years ago

You could disable mptt updates using the disable_mptt_updates context manager, and fill in the left/right values yourself, or run a Node.objects.rebuild() after inserting all those nodes. See http://django-mptt.readthedocs.io/en/latest/mptt.managers.html#mptt.managers.TreeManager.delay_mptt_updates and http://django-mptt.readthedocs.io/en/latest/mptt.managers.html#mptt.managers.TreeManager.disable_mptt_updates

You might also be interested in some work related to bulk loading of nodes: https://github.com/django-mptt/django-mptt/pull/575

eXamadeus commented 5 years ago

I know this isn't quite related, but I just put in a PR #677 that I believe might actually address your concerns.

The gist of the PR is to disable "root node ordering" which is an arbitrarily enforced constraint so that tree queries come back in a prescribed order. I understand the reasoning, but this constraint results in extreme performance degradation at higher volumes of data.

This PR requires nodes to use UUID's for tree_ids instead of integers. This has two benefits:

  1. Performance is greatly boosted for root node manipulation with large datasets.
  2. Concurrency is now supported.
thomas545 commented 4 years ago

how use this methods ( .disable_mptt_updates() , .delay_mptt_updates() )with django admin create / update ??