Open dhimmel opened 8 years ago
Questions from the group at Tuesday night discussion: Do you anticipate complete randomness in the subselection (i.e. totally user selected), or is there some structure that governs what might be asked for? IOW, is chunking an option?
--> Perhaps a cached or database format might be more appropriate? Or microservice? An advantage of a microservice would be the ability to respond to demand.
Do you anticipate complete randomness in the subselection (i.e. totally user selected)
Yes we should be prepared to serve any combination of rows.
Perhaps a cached or database format might be more appropriate? Or microservice? An advantage of a microservice would be the ability to respond to demand.
I like solutions that don't require any running services. Life is so much easier when all you need is a single file. Another option is feather which is a binary format for storing dataframes. While it doesn't support indexed reading (reading only a subset of the overall dataset), it's supposedly really quick.
Currently, it's not too too slow to read the full files, so this may be prematurely optimizing... we could stick with TSV until it becomes a bottleneck?
Tagging @stephenshank and @mike19106, who I think were both interested in this topic.
We may be running a single job per worker instance at a time, with multiple jobs running concurrently via multiple instances. I like the idea of doing it mostly so they are less likely to interfere with each other in isolation.
What makes that relevent to this discussion and and https://github.com/cognoma/cognoma/issues/17 is that we can dedicate a decent amount of memory per job. So in-memory caching becomes more possible.
Currently, we're storing our datasets (which are matrices) as compressed TSVs which are great for long-term interoperable storage. However, we'd love a way to lookup specific rows and columns without having to read the entire dataset. We began discussing options at https://github.com/cognoma/cognoma/issues/17#issuecomment-233149110. We want a persistent storage format (i.e. file) that allows reading only specified rows and columns into a numpy array/matrix or a pandas dataframe.
A primary benchmark for judging implementations is how much time are you saving over reading in the entire bzipped TSV into python via
pandas
for a variety of setups.