Variant searches hangs indefinitely

Jakob37 commented 2 months ago

Hello from Lund!

We have been hunting down mysterious performance drops in Scout causing indefinite high disk usage by Mongo, bogging things down until restart.

Realizing today that it probably is related to the variant search. This exact search hangs (or at least seems to be running indefinitely) while using a lot of disk, halting other activity on the server.

search_variants

It seems it might be due to our large variant collection of 80M+ variants.

Is it possible to speed up the search you think? Or somehow prevent / warn when a user is going to do a huge search.

Also wondering whether this is an issue for you in Stockholm doing a simple gene search.

I guess we either way also should consider trimming down our db, as most likely not all of those variants are relevant 🤔

northwestwitch commented 2 months ago

Hi @Jakob37, I've just tried on our prod server, same gene search and it gets through:

We have something like 465.7M variants in pur database. Could it be a missing index?

Checking on compass we have the following indexes in place:

Jakob37 commented 2 months ago

OK, interesting! That is good to know. Will look over the index again.

Can I ask, roughly how long time does it take for you to do this search? Seconds or more?

northwestwitch commented 2 months ago

Can I ask, roughly how long time does it take for you to do this search? Seconds or more?

9 seconds! :)

dnil commented 2 months ago

What does your partial index criteria look like @Jakob37 - thats really the key thing here. We could not sustain searches against the whole variantS collection either, it's really in how we limit the partial index. Maybe if your db server is a bit less powerful, you could try setting the rank score cutoff for searches a tad higher? Every point at that low level shaves a huge number of variants. https://github.com/Clinical-Genomics/scout/blob/6a52248f53dbecc322dd18e9943af8cd04dacfe2/scout/constants/indexes.py#L76

Jakob37 commented 2 months ago

OK, great tip @dnil ! We will try this out (we have the same as you at the moment)

ehre commented 2 months ago

@Jakob37 Just as a background, I do these types of searches now and then, and in my experience it wasn't a big problem until you moved the servers yesterday. Before, I would get a result in under 30 seconds for sure.

Jakob37 commented 2 months ago

@Jakob37 Just as a background, I do these types of searches now and then, and in my experience it wasn't a big problem until you moved the servers yesterday. Before, I would get a result in under 30 seconds for sure.

OK, thanks for the input @ehre! Hmm. I'll have to ask @mhkc if we have set up the indexes after the move.

SofieCMD commented 2 months ago

I have also done these searches now and then, and they take some time (minutes I would say). Last time (before update) it did not work, and Scout went down for all of us...

Jakob37 commented 2 months ago

Thanks! We will do some testing to see if we can narrow this down.

Jakob37 commented 2 months ago

OK, it indeed seems to be a difference from before. Not due to hardware - v4.88 times out after 10 minutes on both computers. But with the version v4.81, where the search completed after ~4 minutes with heavy disk load.

Checking the diff of indexes.py it seems the partialFilterExpression has changed slightly to include "category": "snv".

Seems like the index is the likely culprit / solution. We will see.

northwestwitch commented 2 months ago

OK, it indeed seems to be a difference from before. Not due to hardware - v4.88 times out after 10 minutes on both computers. But with the version v4.81, where the search completed after ~4 minutes with heavy disk load.

Checking the diff of indexes.py it seems the partialFilterExpression has changed slightly to include "category": "snv".

Seems like the index is the likely culprit / solution. We will see.

Ah must be since we introduced the search of SVs via that form! After this PR

dnil commented 2 months ago

Well, rather we relaxed it to include also SVs at that point. You did reindex during the upgrade to v4.86 didn’t you? You could test this by limiting queries on the search page to SNVs or SVs only.

Jakob37 commented 2 months ago

You did reindex during the upgrade to v4.86 didn’t you?

We ended up doing a big jump 4.81 -> 4.88 (hoping to speed up the updating ahead). So it is now we see the difference (and will do the reindex).

dnil commented 2 months ago

Would it help you if we also noted all index changes in the blog or "breaking" document in addition to the release notes?

Do you also always update databases (genes, transcripts, diseases etc) when you do your upgrades? We attempt to keep all changes backwards compatible, but I'm sure we might forget that we e.g. didn't have a certain data source available several versions back (say ORPHA now). 😊

For Solna, we have previously tried to have those changes around weekends/public Hollidays. Nowadays it seems the mongo engine deals decently well with balancing reindex and production load, but if the UI expects a somewhat complete index it becomes an issue anyway - as you notice.

Jakob37 commented 2 months ago

Would it help you if we also noted all index changes in the blog or "breaking" document in addition to the release notes?

I think the first step for us is to more carefully look through the update notes (I had seen this one, but forgotten it at the time of update). We should include this in our updating checklist, a step on checking whether we need, and if so rebuild indices.

With that said, a note about "breaking" in the release notes here on GitHub would be helpful 😄

Do you also always update databases (genes, transcripts, diseases etc) when you do your upgrades? We attempt to keep all changes backwards compatible, but I'm sure we might forget that we e.g. didn't have a certain data source available several versions back (say ORPHA now). 😊

We have lagged on this as well, but the intent is to from now on make sure that the databases are updated together with when we update Scout.

For Solna, we have previously tried to have those changes around weekends/public Hollidays. Nowadays it seems the mongo engine deals decently well with balancing reindex and production load, but if the UI expects a somewhat complete index it becomes an issue anyway - as you notice.

I see! That sounds like a good strategy. We are in an unfortunate situation running on old hardware only supporting Mongo v4. Things are moving towards getting new hardware. Hope we can hang on until then 🤞 But until then we will have to try keeping reindexing outside office hours yes.

Jakob37 commented 2 months ago

OK, let's close this one. It is on our table now.

Jakob37 commented 2 months ago

A quick follow-up here, also to notice you @ehre . Seems that if I limited the search to SNV it actually completed in ~15 seconds on the new machine. While with no category selected it runs for long time.

So it seems we have the old functionality still running. It is just a question to see how and if we can deal with the new SV search. We'll look into that as well.

ehre commented 2 months ago

Thanks! This is Greek to me, if I understand correctly we perhaps need a "SV-index" as well?

dnil commented 2 months ago

And as @Jakob37 says, as long as you check the SNV Category you will be fine. Screenshot 2024-09-12 at 13 48 30 After re-indexing without the "limit to SNVs" directive it had before v4.86 is complete, you should also see decent times searching both types.

Jakob37 commented 2 months ago

We updated the index over night, and just as @dnil said it would - now SV searches works as well.

Clinical-Genomics / scout

Variant searches hangs indefinitely #4842