Clinical-Genomics-Lund / bonsai

Visualize and analyze resistance and outbreak of bacterial pathogen
https://bonsai-wgs.readthedocs.io/en/latest/
GNU General Public License v3.0
4 stars 1 forks source link

Improve performance of finding and clustering samples #98

Closed mhkc closed 12 months ago

mhkc commented 1 year ago

This PR aims to improve the performance of finding similar samples and clustering multiple samples.

Closes #91 #90

How to setup and perform the tests

Documentation

  1. Have a look at the updated documentation (https://bonsai-wgs--98.org.readthedocs.build/en/98/)

Upload samples

  1. Setup a new instance of Bonsai using docker-compose
  2. Use a script to upload several samples to Bonsai db (see examples below)

Finding similar samples

  1. Use the "find similar" btn in the samples table to search for similar samples with different similarity score and limit
  2. Open the sample view

Clustering

  1. Cluster selected samples from /groups/<group_name>
  2. Add many samples to the basket and cluster them using the different options

Expected outcome

  1. Documentation should be understandable
  2. Uploading samples should be quicker and not time out even if the database is large
  3. Should be possible to upload multiple samples at the same time
  4. Finding similar samples should not block GUI
  5. Finding similar samples should respect the similarity and limit thresholds (see minhash service log)
  6. Clustering many samples should not time out the frontend
  7. Clustering from basket should respect the selected clustering method (see allele_cluster_service and minhash_service logs)
  8. Clustering method should be visible in GrapeTree

A short description of the expected outcome of the test above.

Additional context

DIR="/fs1/results_dev/jasen/saureus/analysis_result/"
ls /fs1/results_dev/jasen/saureus/analysis_result/ | while read file; do 
    echo $(basename $file);
    echo $($file);
    bash ./scripts/upload_sample.sh --api http://mtlucmds2.lund.skane.se:8811 -u admin -p admin --input "${DIR}${file}" --group saureus;
    sleep 0.2;
done

parallel "bash ./scripts/upload_sample.sh --api http://mtlucmds2.lund.skane.se:8811 -u admin -p admin --input $1" ::: /fs1/results_dev/jasen/saureus/analysis_result/*.json
alkc commented 1 year ago

Dunno if related to this PR but found in this branch so posting here. User menu dropdown needs solid bg:

bild

alkc commented 12 months ago

Review checks:

alkc commented 12 months ago

sample upload

Uploading samples seems to work very well in terms of speed and concurrent uploads. I used the parallel approach to uploading samples. There were no timeouts and i did not notice any slowdowns when accessing the frontend during the upload.

alkc commented 12 months ago

minhash tree in sample view

The dendrogram is not loading for any samples (e.g. http://mtlucmds1.lund.skane.se/staging/bonsai/samples/14MT0500494) while the minhash service chews through the genome index queue.

Edit, more info (another sample):

The queue and polling works as it should. But maybe some timeout and related error message here to let the user know that the tree cannot be retrieved atm?

[2023-11-13 10:36:21,824] INFO in samples: ref: 12MT0500043, body: limit=10 similarity=0.9 cluster=True typing_method=<TypingMethod.MINHASH:
 'minhash'> cluster_method=<ClusterMethod.SINGLE: 'single'>, cluster: True
[2023-11-13 10:36:21,825] DEBUG in queue: Pushed job d11e5c75-8a70-43ba-93c6-6e4017e3fb5f into minhash
[2023-11-13 10:36:21,826] DEBUG in minhash: Submitting job, minhash_service.tasks.find_similar_and_cluster to None
INFO:     172.18.0.1:48150 - "POST /samples/12MT0500043/similar HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
INFO:     172.18.0.1:48235 - "GET /job/status/d11e5c75-8a70-43ba-93c6-6e4017e3fb5f HTTP/1.1" 200 OK
.
.
.
alkc commented 12 months ago

find similar samples

Find similar samples does not seem to work. The request is stopped by API and never reaches the minhash service.

All samples are uploaded thru the staging api w/o being assigned to a group, if it matters.

API

INFO:     172.18.0.1:46100 - "POST /samples/12MT0500043/similar HTTP/1.1" 422 Unprocessable Entity

Frontend

[2023-11-13 10:28:30 +0000] [8] [DEBUG] POST /staging/bonsai/samples/12MT0500043/similar
[2023-11-13 10:28:30 +0000] [8] [DEBUG] Query API for samples similar to "12MT0500043", similarity: 0.50, limit:
[2023-11-13 10:29:13 +0000] [8] [DEBUG] POST /staging/bonsai/samples/14MT0500581/similar
[2023-11-13 10:29:13 +0000] [8] [DEBUG] Query API for samples similar to "14MT0500581", similarity: 0.50, limit:
[2023-11-13 10:29:24 +0000] [7] [DEBUG] POST /staging/bonsai/samples/14MT0500581/similar
[2023-11-13 10:29:24 +0000] [7] [DEBUG] Query API for samples similar to "14MT0500581", similarity: 0.90, limit:
alkc commented 12 months ago

clustering

The new clustering controls in the basket are really slick.

I tried to cluster ~260 samples w/ cgMLST and it worked (although it took a minute or so to load). MLST works fine too. the UI is not blocked while the samples are clustered. the clustering method is displayed in grapetree and the cluster service

it's hard to tell from the logs, but judging by the trees tehmselves the cluster service seems to respect user choice of cluster algorithm.

minhash clustering does not work, but it's probably related to the issues in my comments above.

error when loading many samples

for fun I sent all 2395 samples into the cgMLST clustering to see what would happen. the cluster service glitched out with the following error:

2023-11-13 10:48:36,965 [DEBUG] rq.queue: Starting BLMOVE operation for rq:queue:allele_cluster with timeout of 405
q2023-11-13 10:49:46,246 [INFO] rq.worker: allele_cluster: allele_cluster_service.tasks.cluster(method='MSTreeV2', profile='\tSACOL0001\tSAC
OL0002\tSACOL0003\tSACOL0004\tSACOL0005\tSACOL0006\tSACOL0...) (8857459a-77dc-419b-b176-9f27243e8740)
2023-11-13 10:49:47,642 [ERROR] rq.worker: [Job 8857459a-77dc-419b-b176-9f27243e8740]: exception raised while executing (allele_cluster_serv
ice.tasks.cluster)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/rq/worker.py", line 1428, in perform_job
    rv = job.perform()
         ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rq/job.py", line 1278, in perform
    self._result = self._execute()
                   ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rq/job.py", line 1315, in _execute
    result = self.func(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/worker/app/allele_cluster_service/tasks.py", line 27, in cluster
    newick = backend(profile=profile, method=method.value)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/worker/app/allele_cluster_service/ms_trees.py", line 1119, in backend
    profiles.append(np.array(part)[allele_cols])
                    ~~~~~~~~~~~~~~^^^^^^^^^^^^^
IndexError: index 1859 is out of bounds for axis 0 with size 1859

2023-11-13 10:49:47,653 [DEBUG] rq.queue: Starting BLMOVE operation for rq:queue:allele_cluster with timeout of 405

metadata

the samples clustered in grapetree have no metadata associated with them, which was not the case prior to this update.

mhkc commented 12 months ago

Thanks for the comments.

@Jakob37 some answers for your questions.

Could JS-modules be used?

Sure why not.

Commenting. Many in-function comments here seems unnecessary. On the other hand - function top-level comments are useful. Have you considered using JSDoc?

I was not aware of JSdoc. Could be useful to transition into, especially with the sphinx integration.

Are you familiar with Web components? I am wondering whether it would make sense to abstract away some things here (such as the spinner), without doing a full transition to a component-based framework such as React.

Yes. I wanted to initially rely mostly on jinja templates and keep the JS DOM manipulation to a minimum. But since the project have grown, perhaps it would be better to transition to more js.

Many functions access DOM elements from within this code. This tend to become a nightmare to maintain if not done with care in just a few locations, and then propagating values by function parameters instead. But maybe this is done in a controlled way here. Just raising the concern.

Thanks for raising this. Yes its a little bit dangerous and was mostly done either as a temporary fix to push a first release or to rely more on jinja templates.

If you like typing similarly to in Python, JSDoc could be used together with the typescript engine, without needing to transition to Typescript. Could this be useful here?

Thanks for the tip. Perhaps we should transition into TS anyway since the number of views have grown and the reliance of JS?

I think it makes sense collecting style-constants (for both CSS and charts) in a single location, to make it easier to overview and maintain.

What do you mena with style constants? Are you thinking of SASS variables and such?

Jakob37 commented 12 months ago

If you like typing similarly to in Python, JSDoc could be used together with the typescript engine, without needing to transition to Typescript. Could this be useful here?

Thanks for the tip. Perhaps we should transition into TS anyway since the number of views have grown and the reliance of JS?

I like working with TS. It might be overkill for just a small amount of JS, but beyond that I think it helps a lot.

I think it makes sense collecting style-constants (for both CSS and charts) in a single location, to make it easier to overview and maintain.

What do you mena with style constants? Are you thinking of SASS variables and such?

Sorry, I was unclear and a bit confused. I am used to working with "JS components" where the style is controlled in a context where you can use regular variables. I haven't worked with SASS variables. No opinion there at the moment.

alkc commented 12 months ago

Workin' thru the checklist from the start!

Review checks:

Code

Sample upload

Minhash

Cluster

Docs

alkc commented 12 months ago

The minhash-dependent functions like the similarity tree and minhash clustering are still blocked when the minhash service is reprocessing genome signature files (I reuploaded everything via the api) — see my older comments above for more details.

Do we care about that in this update?

alkc commented 12 months ago

I still get an error when searching for similar samples in /groups, even when the minhash service has finished processing the signature files. Seems to be the api complaining about the frontend request:

API log:

INFO:     172.18.0.1:53530 - "POST /samples/14MT0500688/similar HTTP/1.1" 422 Unprocessable Entity

Frontend log:

[2023-11-16 07:57:01 +0000] [7] [DEBUG] POST /staging/bonsai/samples/14MT0500688/similar
[2023-11-16 07:57:01 +0000] [7] [DEBUG] Query API for samples similar to "14MT0500688", similarity: 0.95, limit:

The minhash service is idle.

Edit:

The issue occurs when I forget to set a limit for returned samples in the minhash search options >_>

alkc commented 12 months ago

Clustering w/ minhash works! The minhash service will crash though if any signature file is missing:

Caught an error for this or a future update. The minhash service will crash if a signature file for whatever reason is missing:

2023-11-16 08:06:45,013 [ERROR] rq.worker: [Job 8b722a89-ad3a-4957-9c1c-43e633179fe0]: exception raised while executing (minhash_service.tas
ks.cluster)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/rq/worker.py", line 1428, in perform_job
    rv = job.perform()
         ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rq/job.py", line 1278, in perform
    self._result = self._execute()
                   ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rq/job.py", line 1315, in _execute
    result = self.func(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/worker/app/minhash_service/tasks.py", line 104, in cluster
    newick: str = cluster_signatures(sample_ids, method)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/worker/app/minhash_service/minhash/cluster.py", line 45, in cluster_signatures
    signature = read_signature(sample_id)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/worker/app/minhash_service/minhash/io.py", line 53, in read_signature
    signature_path = get_signature_path(sample_id)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/worker/app/minhash_service/minhash/io.py", line 45, in get_signature_path
    raise FileNotFoundError(f"Signature file not found, {signature_path}")
FileNotFoundError: Signature file not found, /data/signature_db/20MT0500227.sig
mhkc commented 12 months ago

Clustering w/ minhash works! The minhash service will crash though if any signature file is missing:

Caught an error for this or a future update. The minhash service will crash if a signature file for whatever reason is missing:

2023-11-16 08:06:45,013 [ERROR] rq.worker: [Job 8b722a89-ad3a-4957-9c1c-43e633179fe0]: exception raised while executing (minhash_service.tas
ks.cluster)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/rq/worker.py", line 1428, in perform_job
    rv = job.perform()
         ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rq/job.py", line 1278, in perform
    self._result = self._execute()
                   ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/rq/job.py", line 1315, in _execute
    result = self.func(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/worker/app/minhash_service/tasks.py", line 104, in cluster
    newick: str = cluster_signatures(sample_ids, method)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/worker/app/minhash_service/minhash/cluster.py", line 45, in cluster_signatures
    signature = read_signature(sample_id)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/worker/app/minhash_service/minhash/io.py", line 53, in read_signature
    signature_path = get_signature_path(sample_id)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/worker/app/minhash_service/minhash/io.py", line 45, in get_signature_path
    raise FileNotFoundError(f"Signature file not found, {signature_path}")
FileNotFoundError: Signature file not found, /data/signature_db/20MT0500227.sig

I know and I think that is the expected behavior for now as we can then log the stack trace.

mhkc commented 12 months ago

The minhash-dependent functions like the similarity tree and minhash clustering are still blocked when the minhash service is reprocessing genome signature files (I reuploaded everything via the api) — see my older comments above for more details.

Do we care about that in this update?

That is the expected outcome with the current setup. We only have one worker and if there are many jobs they will just be added to the queue. This can be solved later on by having multiple workers pulling jobs from the queue but I dont think we are at that point when that is necessary.

alkc commented 12 months ago

I've ran the similar samples test and confirmed that the GUI is not locked when searching and that the correct similarity threshold and limit reaches the minhash service.

I was able to do a search for similar samples, add the matched samples to basket and send off to clustering. That's a smooth workflow right there.