etianen / django-watson

Full-text multi-table search application for Django. Easy to install and use, with good performance.
BSD 3-Clause "New" or "Revised" License
1.2k stars 129 forks source link

Support async updating of search index #269

Open valentijnscholten opened 4 years ago

valentijnscholten commented 4 years ago

I'm using watson in a django app that has as one of its most important features the importing of files to turn them into database rows, i.e. Django ORM model instances.

Using bulk_create with django is problematic, especially in combination with MySQL due to the ids of the created objects being unknown. So I am thinking about ways to make the import faster, and one way would be to make the watson search index updates asynchronous. An issue is that some model instances are updated (saved) multiple times within one transaction, triggering multiple watson updates.

My thoughts so far:

What possible solutions could be implemented?

Could there be some support in django-watson to support this scenario? Or would it make more sense that a django app just subclasses the middleware and wraps the search_context_manager.end() in a celery task?

Just thinking out loud here and maybe helping others trying to achieve the same.

etianen commented 4 years ago

It's an interesting idea. However, async updates feels a bit niche, and there's so many possible frameworks to choose from it feels like it would be little-used.

I wonder if there's much performance advantage to performing async index updates. Given it's all in the same DB, it feels like batching it all in the same transaction using SearchContextMiddleware is going to be pretty optimal for more cases. If you need async updates, it's probably better to save the primary models AND the watson models in the background task together.

On Sun, 15 Mar 2020 at 14:20, valentijnscholten notifications@github.com wrote:

I'm using watson in a django app that has as one of its most important features the importing of files to turn them into database rows, i.e. Django ORM model instances.

Using bulk_create with django is problematic, especially in combination with MySQL due to the ids of the created objects being unknown. So I am thinking about ways to make the import faster, and one way would be to make the watson search index updates asynchronous. An issue is that some model instances are updated (saved) multiple times within one transaction, triggering multiple watson updates.

My thoughts so far:

-

Make the post_save signal optional and allow the django app itself to update the index in the best way possible, i.e. some celery task already used by my app. This would need a (documented/supported) way to update one or more model instances. This would support deduplication of updates and could be asynchronous.

Then I found the (undocumented?) SearchContextMiddleware which already seems to deduplicate model updates within the same request and batches the index updates all together at the end of the request. This achieves deduplication, but is not yet asynchronous.

What possible solutions could be implemented?

Could there be some support in django-watson to support this scenario? Or would it make more sense that a django app just subclasses the middleware and wraps the search_context_manager.end() in a celery task?

Just thinking out loud here and maybe helping others trying to achieve the same.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/issues/269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEKCALB5XZK4K4HHD2VM3RHTP2JANCNFSM4LLAX7YQ .