Qurro crashes with ultra high dimensional readouts

mortonjt commented 1 year ago

I'm trying to generate reports on some metabolomics datasets with 40K dimensions, and I'm noticing that qurro is able to load this massive datasets. Not sure if others have had experience with this, but I'm raising this issue in case to keep this in the back of our minds. Perhaps filtering is the recommended procedure for now

fedarko commented 1 year ago

Thank you for raising this! Just to clarify, is the Python side of Qurro crashing, or is it able to successfully create a visualization (which then crashes in the browser)? I assume it's the second, but please let me know if it's the first.

The JavaScript code ultimately tries to store the entire BIOM table's worth of information in memory in the browser, so datasets with tens of thousands of features will start to cause problems when loading these visualizations. Out of curiosity, how sparse is your table? The current codebase uses a few optimizations when preparing the visualization (e.g. we only bother storing non-zero counts in memory, which should help for super-sparse tables), but those will become less effective if a table is not very sparse (I don't remember if metabolomics tables are quite as sparse as most 16S / shotgun tables).

There are additional optimizations that should be implementable in the future as well (see here, although not all of these are very relevant to performance), but I don't have much time to actively develop the tool nowadays :( For the time being, I think the best way to handle this issue is using filtering, as you suggested: the -x / --extreme-feature-count parameter should be sufficient.

mortonjt commented 1 year ago

Yes, it crashes in the browser.

Metabolomics data is not sparse, so it likely isn't able to leverage these optimizations. So perhaps filtering is the way to go in the immediate future.

On Fri, Jun 23, 2023 at 8:16 PM Marcus Fedarko @.***> wrote:

Thank you for raising this! Just to clarify, is the Python side of Qurro crashing, or is it able to successfully create a visualization (which then crashes in the browser)? I assume it's the second, but please let me know if it's the first.

The JavaScript code ultimately tries to store the entire BIOM table's worth of information in memory in the browser, so datasets with tens of thousands of features will start to cause problems when loading these visualizations. Out of curiosity, how sparse is your table? The current codebase uses a few optimizations when preparing the visualization (e.g. we only bother storing non-zero counts in memory, which should help for super-sparse tables), but those will become less effective if a table is not very sparse (I don't remember if metabolomics tables are quite as sparse as most 16S / shotgun tables).

There are additional optimizations that should be implementable in the future as well (see here https://github.com/biocore/qurro/labels/optimization, although not all of these are very relevant to performance), but I don't have much time to actively develop the tool nowadays :( For the time being, I think the best way to handle this issue is using filtering, as you suggested: the -x / --extreme-feature-count parameter should be sufficient.

— Reply to this email directly, view it on GitHub https://github.com/biocore/qurro/issues/329#issuecomment-1605178654, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXKRU3EZNPRHP23XUV3XMYWVDANCNFSM6AAAAAAZRSBEVM . You are receiving this because you authored the thread.Message ID: @.***>

biocore / qurro

Qurro crashes with ultra high dimensional readouts #329