cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Persistent storage of matrices that enables quick indexed lookup #9

Open dhimmel opened 8 years ago

dhimmel commented 8 years ago

Currently, we're storing our datasets (which are matrices) as compressed TSVs which are great for long-term interoperable storage. However, we'd love a way to lookup specific rows and columns without having to read the entire dataset. We began discussing options at https://github.com/cognoma/cognoma/issues/17#issuecomment-233149110. We want a persistent storage format (i.e. file) that allows reading only specified rows and columns into a numpy array/matrix or a pandas dataframe.

A primary benchmark for judging implementations is how much time are you saving over reading in the entire bzipped TSV into python via pandas for a variety of setups.

clairemcleod commented 7 years ago

Questions from the group at Tuesday night discussion: Do you anticipate complete randomness in the subselection (i.e. totally user selected), or is there some structure that governs what might be asked for? IOW, is chunking an option?

--> Perhaps a cached or database format might be more appropriate? Or microservice? An advantage of a microservice would be the ability to respond to demand.

dhimmel commented 7 years ago

Do you anticipate complete randomness in the subselection (i.e. totally user selected)

Yes we should be prepared to serve any combination of rows.

Perhaps a cached or database format might be more appropriate? Or microservice? An advantage of a microservice would be the ability to respond to demand.

I like solutions that don't require any running services. Life is so much easier when all you need is a single file. Another option is feather which is a binary format for storing dataframes. While it doesn't support indexed reading (reading only a subset of the overall dataset), it's supposedly really quick.

Currently, it's not too too slow to read the full files, so this may be prematurely optimizing... we could stick with TSV until it becomes a bottleneck?

clairemcleod commented 7 years ago

Tagging @stephenshank and @mike19106, who I think were both interested in this topic.

awm33 commented 7 years ago

We may be running a single job per worker instance at a time, with multiple jobs running concurrently via multiple instances. I like the idea of doing it mostly so they are less likely to interfere with each other in isolation.

What makes that relevent to this discussion and and https://github.com/cognoma/cognoma/issues/17 is that we can dedicate a decent amount of memory per job. So in-memory caching becomes more possible.