davidhwyllie / findNeighbour4

A server delivering large scale, incrementable, bacterial relatedness monitoring
MIT License
3 stars 2 forks source link

Restarting the findneighbour4 server is slow when large numbers of samples are present #114

Closed davidhwyllie closed 2 years ago

davidhwyllie commented 2 years ago

Restarting the findneighbour4 server is slow when large numbers of samples are present. This is due to repopulation of the catwalk component from the database. We need to identify ways to reduce this.

davidhwyllie commented 2 years ago

Options:

  1. add an option not to stop catwalk to the fn4_shutdown script. This is the easiest and best option if catwalk does not need to be shutdown (if for example there is not update to catwalk needed)
  2. rapidly load catwalk over http from some kind of sequence backup file, allowing faster access than using a distant database. Timings need to be worked on. Experiments done by @dvolk suggest that there's little difference in loading samples one at a time vs. en masse over a rest API; an experimental endpoint to allow bulk loading wasn't implemented because it didn't improve load speeds.
  3. generate fasta files suitable for catwalk bulk loading. Note, such files have to be on the machine running catwalk. In benchmarks with covid data, loading one sample took about 0.007 seconds.
    [this is for TB: for covid, loading 1m samples took < 2mins]
  4. use catwalk's own mechanisms to store data and restart. At present, such data is stored in the same directory as the catwalk client (not desirable) and there is a risk of mis-synchronisation with the contents of fn4 with this option.

Other options: build a periodic dump into catwalk cf. https://nim-lang.org/docs/marshal.html but this will block the server and is probably undesirable.

These options are not mutually exclusive.

Of these options: 1 should be implemented 4 should probably not be implemented because of the risk of loading data not in the fn4 database silently Either 2 or 3 would be satisfactory, but will still leave a multi-hour load time

It would be possible to make the fn4 server restart rapidly for READING data (which doesn't need catwalk) and later for inserting data. However, this would need careful implementation.

davidhwyllie commented 2 years ago

Option 1 implemented. fn4_shutdown.sh now has a new optional argument --leave_catwalk_running If invoked with this option, fn4_shutdown.sh will not shutdown catwalk. When the server is subsequently restarted by fn4_startup.sh, no new catwalk instance will be started, and the existing one used.

davidhwyllie commented 2 years ago

Underlying problem remains, which is slow loading of information from database for very large numbers of samples

davidhwyllie commented 2 years ago

After merge of PR #124, restarting takes 30 mins per million samples. This could probably be accelerated if multifasta files were generates (which catwalk reads very fast) as opposed to loading reference compressed sequence data. However, we will consider this closed for now.