Advance clustering on large datasets

mshukla1 commented 7 years ago

...takes too long to complete and fails.

Put a limit on the input for clustering and show a clear message to the user.
See if we can deploy clustering on better machine.

olsonanl commented 7 years ago

First we need to profile the current clustering and see where the time goes.

hyoo commented 7 years ago

Here is a list of issues related to this problem

dataset (genome-occurance count) sent to data api => depends on your network speed, you can be timed out (time include data transfer, computation, and transfer back)
data stored in file and run cluster problem (single cpu version) => computation complexity may vary depends on your options => parallelizing algorithm may help to reduce compute time => with multiple requests, cpu could be saturated. but recently we moved data api to chestnut, which has more CPUs.

Primary problem is that we need to send data back to server to run cluster. If we can run it on browser side (like WebAssembly, or BLAS library), that will solve problem.

If client side computation is not feasible, we can consider

compressing data when browser sends data to api
use machine that has faster CPU
config longer timeout for this specific service

PATRIC3 / patric3_website

Advance clustering on large datasets #1389