etianen / django-watson

Full-text multi-table search application for Django. Easy to install and use, with good performance.
BSD 3-Clause "New" or "Revised" License
1.2k stars 129 forks source link

Added --slice-queryset argument #284

Open iJohnMaged opened 3 years ago

iJohnMaged commented 3 years ago

Came across your library and wanted to integrate it for a client, great work!

However when I was deploying on a relatively big database (2M rows, big model with lots of text data), the process was always getting killed on PythonAnywhere while using all the CPU and ram available, without creating a single index in watson_searchentry.

So I tinkered a bit and found that .iterator() is the issue in my case (limited resources, MySQL database too), buildwatson doesn't get to create any index, eventually changed the code to slice instead of .iterator and it got through.

I add an argument to buildwatson called --slice-queryset to slice it instead of iterate, if that works for others in some cases.

etianen commented 3 years ago

Thanks for this, and for the tests.

I'm really surprised this works! Performing a count on a large dataset can be very slow, and chunking through a dataset gets slower each chunk due to how databases perform offsetting.

But, pragmatically, for databases that don't support server side cursors, it's better than nothing.

I would suggest the following changes:

  1. Lose the count. Just keep slicing until you get no results.
  2. Rather than using offset, order the queryset by pk (asc), and after each batch, filter the next batch by pk__gt=final_pk_of_prev_batch

(2) is really important. It means this will scale to larger datasets, and it means, for auto-increment pks, no models will be missed if the data is being edited.

iJohnMaged commented 3 years ago

I'll fix the build checks errors and implement those changes, you're completely right about those and I ended up implementing that in the view anyway for search!

danihodovic commented 2 years ago

Running into a similar issue on 100k rows :/ Any chance of merging this John?

etianen commented 2 years ago

I can't merge this without the suggested changes, and broken builds, being fixed. I'm happy to consider another PR, or updates to this one.

danihodovic commented 2 years ago

@etianen Did you publicize the library mentioned in https://github.com/etianen/django-watson/issues/26#issuecomment-26192741 :) ?

etianen commented 2 years ago

Nope, sorry! And it's likely lost to time now.

On Sun, 21 Aug 2022 at 15:43, Dani Hodovic @.***> wrote:

@etianen https://github.com/etianen Did you publicize the library mentioned in #26 (comment) https://github.com/etianen/django-watson/issues/26#issuecomment-26192741 :) ?

— Reply to this email directly, view it on GitHub https://github.com/etianen/django-watson/pull/284#issuecomment-1221559198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABEKCFJRL4XXCBHKNIYIJLV2I6ADANCNFSM5CGGDQLA . You are receiving this because you were mentioned.Message ID: @.***>