biocore / microsetta-public-api

A public microservice to support The Microsetta Initiative
BSD 3-Clause "New" or "Revised" License
2 stars 4 forks source link

Cache beta #103

Closed wasade closed 3 years ago

wasade commented 3 years ago

Fixes #102

wasade commented 3 years ago

Thanks!! so we'll still have long load times but resident will be small so we can spin up many servers to handle requests.

eventually we should do this directly from the hdf5 distance matrices which should greatly improve start up time, but that is more complex

gwarmstrong commented 3 years ago

How complex would it be to add a k_neighbors step to the microsetta processing pipeline and save the resulting sparse distance matrix as a biom table? Loading biom is already supported by the public API and converting to something like the DataFrame in this PR should not be too hard.

wasade commented 3 years ago

Great idea. It can't be biom though as the table contains str not int or float.

Could you knock out the config pieces for this? i can produce a tabular file suitable for pd.from_csv('foo', sep='\t', dtype=str).set_index('sample-id') where the columns are like what's here. I'll migrate this neighbor code here into microsetta-processing, PR in ~15 i think

There are multiple benefits: load time, resident memory, and we can more easily control what set of samples the neighbors are too (e.g., TMI sample, with neighbors assessed via non-human vertebrates)

gwarmstrong commented 3 years ago

It can't be biom though as the table contains str not int or float.

so the idea was something approximately like

# in processing
from biom import Table
table = Table(sparse_distance_matrix, sample_ids, sample_ids)

# in public api
spare_dm = read_biom('./path/to/table.biom')
cached_ids = create_pandas_cache(spare_dm)

but the tabular file could work too

wasade commented 3 years ago

Huh, hadn't thought of it like that. I don't think we have a use case right now for the actual distances?

gwarmstrong commented 3 years ago

No as far as I’m aware

wasade commented 3 years ago

Okay, code is moved to processing. I'll kcik stuff off to run over the next 24h or so so we have new files / configs

wasade commented 3 years ago

@gwarmstrong i think this is all hooked up now to the cached datasource

wasade commented 3 years ago

Thanks @gwarmstrong! I'm testing this branch right now and there is still an unexplained massive memory use, so gathering a little more information right now. I'll circle back to the comments as soon as I can

wasade commented 3 years ago

+1