Build SQLite conversion tool

gwaybio commented 2 years ago

In #202 we discuss how building a SQLite2XXX (where XXX represents our preferred pycytominer data type) would enable us to sunset SQLite dependencies in this repo.

Any tool that we build will likely contain some (or all) of the SQLite hygiene functions @d33bs wrote in #203

I think this tool would be a new package in cytomining and exist outside of pycytominer

d33bs commented 2 years ago

If it doesn't already somehow exist, it might be beneficial to consider making data type mapping decisions for the cytomining repos with this issue when it comes to data at-rest (files) and in-use (memory). For example, Pandas infers Pythonic datatypes per SQLite types when using read_sql. This can have impacts downstream, as discussed in #198 . Types could be cast or validated within the conversion efforts.

A good example of why this matters are Datetimes:

SQLite may store these as strings or numbers
Pandas (or other in-memory formats) can interpret these as strings or numbers (in addition to datetime64 etc)
XXX may have yet more options

A type mapping table could look something like the following. We might also want to call out how non-values (None, NULL, np.NaN, pd.NA, etc) are handled. SQLite Type	XXX Type
TEXT	XXX Object/String Type
REAL	XXX Float Type
...	...

gwaybio commented 2 years ago

I just had a chat with @staylorx - @d33bs I think you're all set to start up on this. Exciting! Thanks

d33bs commented 2 years ago

Hi @gwaybio - I'm excited to dive into this! In thinking about this a bit more I'm wondering if there's any schema or standardized structure for the source data. A few questions to this effect:

What is the "original" data format prior to arriving in SQLite (application, file format, signals etc)?
Does this data have a formal structure (including columns, datatypes, etc) which could be referenced as part of this issue.
Also, if this structure is available, how dynamic is it (how often should we expect to grow, contract, or change at-will)?

gwaybio commented 2 years ago

Sorry for the delayed reply, my answers as follows:

What is the "original" data format prior to arriving in SQLite (application, file format, signals etc)?

Prior to SQLite, the data are a collection of .csv files derived from the program CellProfiler. The data represent single cell morphological profiles - in other words, various different measurements (hundreds or thousands of measurements) of what single cells look like based on microscopy images.

Does this data have a formal structure (including columns, datatypes, etc) which could be referenced as part of this issue.

I'm not exactly sure how to answer this question - what do you mean more specifically by "formal"? It might be helpful to point you to the test files we use as input for creating the SQLite files: https://github.com/cytomining/cytominer-database/tree/master/tests/data_b

Also, if this structure is available, how dynamic is it (how often should we expect to grow, contract, or change at-will)?

The data are fairly stable, will range roughly somewhere between 500MB and 20GB, and is not likely to change much

d33bs commented 2 years ago

Thank you @gwaybio ! This is super helpful.

Looking through the sample datasets under the cytominer-database repo I'm seeing some differences there in comparison to the SQLite file we worked with in #195 (and others). It looks similar enough but I'd like to double check on this: is the image table field ImageNumber (or in SQLite this seems to appear as TableNumber) generally used throughout as a uniform reference key for other tables within these datasets? Stated a different way: if one wanted to see all data related to a single image, could they use this as a key to join or merge datasets (without additional transformation)?

gwaybio commented 2 years ago

is the image table field ImageNumber (or in SQLite this seems to appear as TableNumber) generally used throughout as a uniform reference key for other tables within these datasets?

These are two different columns, and, together with ObjectNumber, form uniform reference keys across tables.

Some more pointers:

We use these columns to merge tables
We use ObjectNumber in combination with other identifiers depending on the biological compartment

Stated a different way: if one wanted to see all data related to a single image, could they use this as a key to join or merge datasets (without additional transformation)?

With the additional ObjectNumber identifier, yes

d33bs commented 2 years ago

Thanks @gwaybio! A few follow-up questions based on what you mention:

Is ObjectNumber unique to each biological compartment, meaning when we reference the field from each compartment it's akin to for example: Cells_ObjectNumber, Nucleus_ObjectNumber, Cytoplasm_ObjectNumber?
How would the Cells and Nuclei tables be cross-referenced with one another? For example, does one have to traverse the Cytoplasm.ObjectNumber as a bridge from Cells to Nuclei (or vice-versa)?
How do the fields in the Image table relate (if at all) to the ObjectNumber fields in the biological compartment tables? Would it be accurate to say one Image (TableNumber, ImageNumber) combination has potentially many unique ObjectNumbers under each compartment?

gwaybio commented 2 years ago

Is ObjectNumber unique to each biological compartment, meaning when we reference the field from each compartment it's akin to for example: Cells_ObjectNumber, Nucleus_ObjectNumber, Cytoplasm_ObjectNumber?

How would the Cells and Nuclei tables be cross-referenced with one another? For example, does one have to traverse the Cytoplasm.ObjectNumber as a bridge from Cells to Nuclei (or vice-versa)?

Perhaps an answer to both questions, IIUC:

ObjectNumber per compartment (Cells, Nucleus, Cytoplasm) is typically not unique. i.e. aligning object numbers across compartments is likely to align measurements for the same cell, but it is not guaranteed. This is why we use other columns to link measurements:

https://github.com/cytomining/pycytominer/blob/02d05223effad9c12239ba331a8ce709153b18a4/pycytominer/cyto_utils/single_cell_ingest_utils.py#L5-L23

How do the fields in the Image table relate (if at all) to the ObjectNumber fields in the biological compartment tables?

They don't! If I'm remembering correctly, the compartment tables will also have a TableNumber and ImageNumber that we use to link which image the individual cells were measured from.

Would it be accurate to say one Image (TableNumber, ImageNumber) combination has potentially many unique ObjectNumbers under each compartment?

Yes, exactly. Unless the image contained only one cell :)

d33bs commented 2 years ago

Thanks @gwaybio for the data details - this is very helpful in understanding what we might do towards conversion. Based on this information I've created a sketch of how we could approach merging the tables as one single dataset (for export to parquet as a single file, discussed in #202). Within this sketch:

I treat ObjectNumber from each compartment as unique from one another to help show that they may not be related to one another.
Nulls are used to avoid representing implied relationships where we aren't certain of it (without losing the capability to build those relationships as needed).

Rough table format:

TableNumber	ImageNumber	Cytoplasm_ObjectNumber	Cells_ObjectNumber	Nucleus_ObjectNumber	Image_Fields...(many)	Cytoplasm_Fields...(many)	Cells_Fields...(many)	Nucleus_Fields...(many)
STRING (Not Null)	INT64 (Not Null)	INT64 (Nullable)	INT64 (Nullable)	INT64 (Nullable)	Various (Populated for all ..._ObjectNumber == Null)	Various (Cytoplasm_ObjectNumber != Null)	Various (Cells_ObjectNumber != Null)	Various (Nucleus_ObjectNumber != Null)

Example with data:

TableNumber	ImageNumber	Cytoplasm_ObjectNumber	Cells_ObjectNumber	Nucleus_ObjectNumber	Image_Fields...(many)	Cytoplasm_Fields...(many)	Cells_Fields...(many)	Nucleus_Fields...(many)
123abc	1	Null	Null	Null	Image Data...	Null	Null	Null
123abc	1	1	Null	Null	Null	Cytoplasm Data...	Null	Null
123abc	1	Null	1	Null	Null	Null	Cells Data...	Null
123abc	1	Null	Null	1	Null	Null	Null	Nucleus Data...

I'd welcome any feedback or input you may have on this! How does the above look? Could this work as a SQLite-to-parquet conversion format for this project?

gwaybio commented 2 years ago

This is great @d33bs - yes, I think this is on the right track.

Two other nuances that will be important to keep in mind, but shouldn't stall progress:

The Cytoplasm_ObjectNumber is not used to match cells and nuclei. The cytoplasm table has two additional columns: Cytoplasm_Parent_Cells and Cytoplasm_Parent_Nuclei (see default_linking_cols)
This design refers to the standard SQLite file, with standard compartments. There are cases where we could have other compartments beyond Cells, Cytoplasm, and Nuclei (e.g. Mito, Other, etc.). I think building to the standard first is the right approach, but it is worth keeping compartment modularity in mind.

d33bs commented 2 years ago

Hi @gwaybio - something I'm noticing in working with related SQLite and CSV files is that TableNumber is sometimes not included within sample CSV's. Could I ask where TableNumber is generated and whether it's an optional field which may not always be present within SQLite files for conversion?

gwaybio commented 2 years ago

TableNumber is generated in CellProfiler, I believe. I am not familiar enough with the code base to point to it.

@bethac07 - can you provide a quick pointer? If you have to dig, no worries - I thought that you might just know.

bethac07 commented 2 years ago

TableNumber is actually not made by CellProfiler, at least in our typical use case (using ExportToSpreadsheet)- It's added in cytominer-database.

The reason for this column to exist is that depending on how the data is passed into CellProfiler, some data input methods essentially "reset" ImageNumber - so instead of A01-Site1 being ImageNumber=1 and A01-Site2 being ImageNumber=2, both might be ImageNumber=1. This causes problems because now ImageNumber isn't a unique key that we can ie map across to the object tables, etc. So an extra "indexing" column is always added.

d33bs commented 2 years ago

Thank you @bethac07 (and @gwaybio)! Based on what you mention, I'll continue to use TableNumber as a field referenced for conversion work. Generally, my plan is to use TableNumber and ImageNumber as the "joinable keys" for all tables within existing the SQLite databases (presuming ObjectNumber may not always be related across components).

d33bs commented 2 years ago

Hi @gwaybio - I wanted to follow up with some findings and ask a few questions to navigate what conversion will look like. After testing a few different libraries and methods for this work, I'm finding there are significant resource implications. These implications can be broken down into a couple of categories seen below.

Both of these approaches have the goal of a single large and sparse dataset (4 tables into a single table or "frame") as previously discussed in this issue. They also presume no external resources - that the process may only consume local CPU and memory. I attempted to use alternatives to Pandas due to the benefits these sometimes offer when it comes to overcoming constraints similar to the ones we face.

Machine:
- GCP e2-highmem-16
- vCPU: 16
- Mem: 128 GB
SQLite Source:
- Original: SQ00014613.sqlite
- Modified to replace "nan" strings with SQLite None's where appropriate (discussion in #198).
- Modified File Size: 13.2 GB
Parquet Target:
- Rough dataset size: ~10 GB (for both single- and multi-file results)

Multiple Tables to One Frame and One File

I attempted to merge all four tables to one single frame (with none or none-like elements where appropriate), then export to a single parquet file. This turned out to require a large amount of memory in order to successfully run. In many cases, the procedure failed using some alternative to the below, I believe due to excessive memory usage. Some numbers here (note - I believe the excess above system memory was made available through disk-based caching):

Pandas:
- Memory: 151.6 GB
- Time: ~45 minutes
Polars (link):
- Memory: 147.0 GB
- Time: ~32 minutes

Multiple Tables to Many Frames and Many Files (as One Dataset)

Using single dataframes and files may not scale well because we'll eventually reach physical limits of either memory or filesize constraints. To address this, we can use many dataframes and many files using chunking. When working with parquet, many libraries enable passing a directory or glob/wildcard conventions to read from multiple files as one cohesive dataset (see polars IO dealing with multiple files or pandas.read_parquet path spec references for more detail here).

For some chunking tests, I chose to use identical dataframe and file chunking based on sets of 50 distinct/unique joinable keys (TableNumber and ImageNumber by default). For SQ00014613.sqlite, there are 3,455 distinct TableNumber + ImageNubmer keys - meaning a resuling 70 parquet files total (as example_<chunknumber>.parquet). When reading the below, note that adjusting to lower chunk sizes (for example, < 50) would theoretically mean reduced memory and increased time duration (via more chunks, more read/writes, more files). If this approach is something we opt for, adjusting chunk size might be something for the researcher or developer to consider based on their environment and data source.

Pandas:
- Memory: 97.3 GB
- Time: ~57 minutes
Polars:
- Memory: 100.9 GB
- Time: ~30 minutes

Follow-up Questions

Based on the above findings, I have some questions:

What resources (memory, CPU, etc.) may we / should we expect for a machine running Pycytominer SQLite conversion work?
With resources considered, is there a time duration that we should seek to remain under?
How do you feel about a multi-file single dataset (moving away from single-file format)?
Would a single dataset this large be optimal for Pycytomining work?
- Considerations:
  - On the machine mentioned for the tests above, Pandas on it's own struggled to read the resulting full dataset from parquet.
  - Reading a resulting dataset this large likely might benefit from "lazy" (delayed until compute) and/or partitioned queries (specific columns or other constraints) in order to be effective. For example, I used Polars towards this effect here (taking advantage of Arrow and Rust performance through Python). Would it be beneficial to explore libraries which enable this in context with existing code, the developer community, supportability, etc?
  - How might this dataset be compatible (or otherwise) with other Cytomining community work?

gwaybio commented 2 years ago

Hi @d33bs - thanks for this deep dive. The memory requirements are a bit concerning. Is it possible to stream elements of the SQLite into a continuously written parquet file? I think 1) we'd sacrifice time over memory, and 2) we'll only need to use this SQLite converter tool for legacy data (after we update cytominer-database/transport to output parquet directly)

I'll also answer your questions below:

What resources (memory, CPU, etc.) may we / should we expect for a machine running Pycytominer SQLite conversion work?

Average computer: 64GB, 12-16 core CPU

With resources considered, is there a time duration that we should seek to remain under?

No hard duration, but hours per file seems ok

How do you feel about a multi-file single dataset (moving away from single-file format)?

Multi-file single dataset is not ideal for distribution, so we'd prefer single-file. This conversion tool likely won't be used in a new pipeline (only for legacy datasets). The pipeline from CellProfiler/DeepProfiler -> Cytominer-database/transport goes from multi-files to a single-file. Are there benefits to a multi-file that I'm not thinking of?

Would a single dataset this large be optimal for Pycytomining work?

This file is on the smaller side. They're usually 10-20GB (I've seen some go as high as 50GB!)

10GB post single-cell merge is perfect

Considerations: On the machine mentioned for the tests above, Pandas on it's own struggled to read the resulting full dataset from parquet. Reading a resulting dataset this large likely might benefit from "lazy" (delayed until compute) and/or partitioned queries (specific columns or other constraints) in order to be effective. For example, I used Polars towards this effect here (taking advantage of Arrow and Rust performance through Python). Would it be beneficial to explore libraries which enable this in context with existing code, the developer community, supportability, etc?

Yes! We are definitely interested in these performance upgrades. While it is worth keeping in mind, lazy computation post data conversion is beyond scope of this issue (focused on the converter tool).

How might this dataset be compatible (or otherwise) with other Cytomining community work?

It is on the smaller size, 10-20GB is standard (I've seen SQLite up to 50GB!)

One other minor note, in case it makes a difference in your thinking:

Both of these approaches have the goal of a single large and sparse dataset

Our data are not sparse, they contain mostly non-zero elements.

shntnu commented 2 years ago

Now that @bunnech has a great solution for #195 (h/t @johnarevalo), would it be useful to create a tool right here in pycytominer to convert SQLite to Parquet? @bunnech already has code she is willing to contribute (as a stub, but she might not have the bandwidth to take it all the way through right now)

Ideally, we'd create a separate tool (as @gwaybio suggested https://github.com/cytomining/pycytominer/issues/205#issue-1265100731) in https://github.com/cytomining/ after fully thinking through the design. But I wonder if it's wiser to start with what's easiest right now, which is to take @bunnech's script, clean it up a bit, and create a prototype that will help us think through how we can perfect it over time.

gwaybio commented 2 years ago

I believe @d33bs is working on this! My understanding is that it is almost done. @d33bs - can you comment?

shntnu commented 2 years ago

That’s fantastic - let us know when you get the chance @d33bs.

gwaybio commented 2 years ago

Current status in #213 FWIW (in case you're curious @shntnu !)

d33bs commented 2 years ago

Hi @shntnu and @gwaybio - I feel like the best thing to do would be for me to close #213, taking into account the performance improvements enabled via #219 from @bunnech . I'd love to assist with ongoing efforts here once the stub is in place.

gwaybio commented 2 years ago

Perfect! Ok, here's how I propose we move forward:

From @shntnu:

take @bunnech's script, clean it up a bit, and create a prototype that will help us think through how we can perfect it over time. @bunnech already has code she is willing to contribute (as a stub, but she might not have the bandwidth to take it all the way through right now)

Sounds great. @bunnech is this something you can take on? Can you file a pull request adding your prototype?

From @d33bs:

I'd love to assist with ongoing efforts here once the stub is in place.

Sounds good! I bet that once @bunnech posts the prototype, you can take charge and apply your learnings from #213 (especially in regards to testing) to take this through to merge.

bunnech commented 2 years ago

Yes, sounds good to me. I will add @d33bs as a co-author to the PR already. Will prepare something later today!

gwaybio commented 1 year ago

@d33bs - should we close this issue?

d33bs commented 1 year ago

@gwaybio - yes, I feel this can be closed. I think the work which @bunnech added is great and covers things well. Additionally, the work happening in cytomining/CytoTable#5 (among others) will hopefully add additional capabilities in the near future.

cytomining / pycytominer