cytomining / cytominer-database

A package for storing morphological profiling data.
Other
10 stars 10 forks source link

Handle very large datasets (>100M single cells) #96

Open shntnu opened 6 years ago

shntnu commented 6 years ago

Come up with a solution for ingesting CSVs into a backend that is more scalable than SQLite.

Context: The package cytominer-database is a key component of image-based profiling workflows: it is a small Python-based command line tool that ingests data generated from CellProfiler into a database. Currently, only SQLite is supported but it would be great if we could use something more scalable. Addressing this issue would equip researchers working with single cell imaging data to execute queries and perform analysis across tens of millions of cells. This would be particularly useful to analyze single cell data across all plates in an experiments, or across multiple experiments.

E.g. a recent 135 plate experiment had 100M single cells and there's no easy way to analyze this.

Here's how to started on this issue:

gwaybio commented 5 years ago

I would just add that if there any questions, to add them here or open a new issue - intro steps look great 👍

shntnu commented 4 years ago

https://parquet.apache.org/ looks like a good option for this