Closed wasade closed 3 years ago
Thanks!! so we'll still have long load times but resident will be small so we can spin up many servers to handle requests.
eventually we should do this directly from the hdf5 distance matrices which should greatly improve start up time, but that is more complex
How complex would it be to add a k_neighbors
step to the microsetta processing pipeline and save the resulting sparse distance matrix as a biom table? Loading biom is already supported by the public API and converting to something like the DataFrame in this PR should not be too hard.
Great idea. It can't be biom though as the table contains str not int or float.
Could you knock out the config pieces for this? i can produce a tabular file suitable for pd.from_csv('foo', sep='\t', dtype=str).set_index('sample-id')
where the columns are like what's here. I'll migrate this neighbor code here into microsetta-processing, PR in ~15 i think
There are multiple benefits: load time, resident memory, and we can more easily control what set of samples the neighbors are too (e.g., TMI sample, with neighbors assessed via non-human vertebrates)
It can't be biom though as the table contains str not int or float.
so the idea was something approximately like
# in processing
from biom import Table
table = Table(sparse_distance_matrix, sample_ids, sample_ids)
# in public api
spare_dm = read_biom('./path/to/table.biom')
cached_ids = create_pandas_cache(spare_dm)
but the tabular file could work too
Huh, hadn't thought of it like that. I don't think we have a use case right now for the actual distances?
No as far as I’m aware
Okay, code is moved to processing. I'll kcik stuff off to run over the next 24h or so so we have new files / configs
@gwarmstrong i think this is all hooked up now to the cached datasource
Thanks @gwarmstrong! I'm testing this branch right now and there is still an unexplained massive memory use, so gathering a little more information right now. I'll circle back to the comments as soon as I can
+1
Fixes #102