Closed gwaybio closed 1 year ago
If it doesn't already somehow exist, it might be beneficial to consider making data type mapping decisions for the cytomining
repos with this issue when it comes to data at-rest (files) and in-use (memory). For example, Pandas infers Pythonic datatypes per SQLite types when using read_sql
. This can have impacts downstream, as discussed in #198 . Types could be cast or validated within the conversion efforts.
A good example of why this matters are Datetimes:
datetime64
etc)A type mapping table could look something like the following. We might also want to call out how non-values (None, NULL, np.NaN, pd.NA, etc) are handled. SQLite Type | XXX Type |
---|---|
TEXT | XXX Object/String Type |
REAL | XXX Float Type |
... | ... |
I just had a chat with @staylorx - @d33bs I think you're all set to start up on this. Exciting! Thanks
Hi @gwaybio - I'm excited to dive into this! In thinking about this a bit more I'm wondering if there's any schema or standardized structure for the source data. A few questions to this effect:
Sorry for the delayed reply, my answers as follows:
What is the "original" data format prior to arriving in SQLite (application, file format, signals etc)?
Prior to SQLite, the data are a collection of .csv
files derived from the program CellProfiler. The data represent single cell morphological profiles - in other words, various different measurements (hundreds or thousands of measurements) of what single cells look like based on microscopy images.
Does this data have a formal structure (including columns, datatypes, etc) which could be referenced as part of this issue.
I'm not exactly sure how to answer this question - what do you mean more specifically by "formal"? It might be helpful to point you to the test files we use as input for creating the SQLite files: https://github.com/cytomining/cytominer-database/tree/master/tests/data_b
Also, if this structure is available, how dynamic is it (how often should we expect to grow, contract, or change at-will)?
The data are fairly stable, will range roughly somewhere between 500MB and 20GB, and is not likely to change much
Thank you @gwaybio ! This is super helpful.
Looking through the sample datasets under the cytominer-database
repo I'm seeing some differences there in comparison to the SQLite file we worked with in #195 (and others). It looks similar enough but I'd like to double check on this: is the image table field ImageNumber
(or in SQLite this seems to appear as TableNumber
) generally used throughout as a uniform reference key for other tables within these datasets? Stated a different way: if one wanted to see all data related to a single image, could they use this as a key to join or merge datasets (without additional transformation)?
is the image table field ImageNumber (or in SQLite this seems to appear as TableNumber) generally used throughout as a uniform reference key for other tables within these datasets?
These are two different columns, and, together with ObjectNumber
, form uniform reference keys across tables.
Some more pointers:
ObjectNumber
in combination with other identifiers depending on the biological compartmentStated a different way: if one wanted to see all data related to a single image, could they use this as a key to join or merge datasets (without additional transformation)?
With the additional ObjectNumber
identifier, yes
Thanks @gwaybio! A few follow-up questions based on what you mention:
Is ObjectNumber unique to each biological compartment, meaning when we reference the field from each compartment it's akin to for example: Cells_ObjectNumber, Nucleus_ObjectNumber, Cytoplasm_ObjectNumber?
How would the Cells and Nuclei tables be cross-referenced with one another? For example, does one have to traverse the Cytoplasm.ObjectNumber as a bridge from Cells to Nuclei (or vice-versa)?
Perhaps an answer to both questions, IIUC:
ObjectNumber
per compartment (Cells, Nucleus, Cytoplasm) is typically not unique. i.e. aligning object numbers across compartments is likely to align measurements for the same cell, but it is not guaranteed. This is why we use other columns to link measurements:
How do the fields in the Image table relate (if at all) to the ObjectNumber fields in the biological compartment tables?
They don't! If I'm remembering correctly, the compartment tables will also have a TableNumber and ImageNumber that we use to link which image the individual cells were measured from.
Would it be accurate to say one Image (TableNumber, ImageNumber) combination has potentially many unique ObjectNumbers under each compartment?
Yes, exactly. Unless the image contained only one cell :)
Thanks @gwaybio for the data details - this is very helpful in understanding what we might do towards conversion. Based on this information I've created a sketch of how we could approach merging the tables as one single dataset (for export to parquet as a single file, discussed in #202). Within this sketch:
Rough table format:
TableNumber | ImageNumber | Cytoplasm_ObjectNumber | Cells_ObjectNumber | Nucleus_ObjectNumber | Image_Fields...(many) | Cytoplasm_Fields...(many) | Cells_Fields...(many) | Nucleus_Fields...(many) |
---|---|---|---|---|---|---|---|---|
STRING (Not Null) | INT64 (Not Null) | INT64 (Nullable) | INT64 (Nullable) | INT64 (Nullable) | Various (Populated for all ..._ObjectNumber == Null) | Various (Cytoplasm_ObjectNumber != Null) | Various (Cells_ObjectNumber != Null) | Various (Nucleus_ObjectNumber != Null) |
Example with data:
TableNumber | ImageNumber | Cytoplasm_ObjectNumber | Cells_ObjectNumber | Nucleus_ObjectNumber | Image_Fields...(many) | Cytoplasm_Fields...(many) | Cells_Fields...(many) | Nucleus_Fields...(many) |
---|---|---|---|---|---|---|---|---|
123abc | 1 | Null | Null | Null | Image Data... | Null | Null | Null |
123abc | 1 | 1 | Null | Null | Null | Cytoplasm Data... | Null | Null |
123abc | 1 | Null | 1 | Null | Null | Null | Cells Data... | Null |
123abc | 1 | Null | Null | 1 | Null | Null | Null | Nucleus Data... |
I'd welcome any feedback or input you may have on this! How does the above look? Could this work as a SQLite-to-parquet conversion format for this project?
This is great @d33bs - yes, I think this is on the right track.
Two other nuances that will be important to keep in mind, but shouldn't stall progress:
Cytoplasm_ObjectNumber
is not used to match cells and nuclei. The cytoplasm table has two additional columns: Cytoplasm_Parent_Cells
and Cytoplasm_Parent_Nuclei
(see default_linking_cols
)Cells
, Cytoplasm
, and Nuclei
(e.g. Mito
, Other
, etc.). I think building to the standard first is the right approach, but it is worth keeping compartment modularity in mind.Hi @gwaybio - something I'm noticing in working with related SQLite and CSV files is that TableNumber is sometimes not included within sample CSV's. Could I ask where TableNumber is generated and whether it's an optional field which may not always be present within SQLite files for conversion?
TableNumber is generated in CellProfiler, I believe. I am not familiar enough with the code base to point to it.
@bethac07 - can you provide a quick pointer? If you have to dig, no worries - I thought that you might just know.
TableNumber is actually not made by CellProfiler, at least in our typical use case (using ExportToSpreadsheet)- It's added in cytominer-database.
The reason for this column to exist is that depending on how the data is passed into CellProfiler, some data input methods essentially "reset" ImageNumber - so instead of A01-Site1 being ImageNumber=1 and A01-Site2 being ImageNumber=2, both might be ImageNumber=1. This causes problems because now ImageNumber isn't a unique key that we can ie map across to the object tables, etc. So an extra "indexing" column is always added.
Thank you @bethac07 (and @gwaybio)! Based on what you mention, I'll continue to use TableNumber as a field referenced for conversion work. Generally, my plan is to use TableNumber and ImageNumber as the "joinable keys" for all tables within existing the SQLite databases (presuming ObjectNumber may not always be related across components).
Hi @gwaybio - I wanted to follow up with some findings and ask a few questions to navigate what conversion will look like. After testing a few different libraries and methods for this work, I'm finding there are significant resource implications. These implications can be broken down into a couple of categories seen below.
Both of these approaches have the goal of a single large and sparse dataset (4 tables into a single table or "frame") as previously discussed in this issue. They also presume no external resources - that the process may only consume local CPU and memory. I attempted to use alternatives to Pandas due to the benefits these sometimes offer when it comes to overcoming constraints similar to the ones we face.
I attempted to merge all four tables to one single frame (with none or none-like elements where appropriate), then export to a single parquet file. This turned out to require a large amount of memory in order to successfully run. In many cases, the procedure failed using some alternative to the below, I believe due to excessive memory usage. Some numbers here (note - I believe the excess above system memory was made available through disk-based caching):
Using single dataframes and files may not scale well because we'll eventually reach physical limits of either memory or filesize constraints. To address this, we can use many dataframes and many files using chunking. When working with parquet, many libraries enable passing a directory or glob/wildcard conventions to read from multiple files as one cohesive dataset (see polars IO dealing with multiple files or pandas.read_parquet path spec references for more detail here).
For some chunking tests, I chose to use identical dataframe and file chunking based on sets of 50 distinct/unique joinable keys (TableNumber and ImageNumber by default). For SQ00014613.sqlite
, there are 3,455 distinct TableNumber + ImageNubmer keys - meaning a resuling 70 parquet files total (as example_<chunknumber>.parquet
). When reading the below, note that adjusting to lower chunk sizes (for example, < 50) would theoretically mean reduced memory and increased time duration (via more chunks, more read/writes, more files). If this approach is something we opt for, adjusting chunk size might be something for the researcher or developer to consider based on their environment and data source.
Based on the above findings, I have some questions:
Hi @d33bs - thanks for this deep dive. The memory requirements are a bit concerning. Is it possible to stream elements of the SQLite into a continuously written parquet file? I think 1) we'd sacrifice time over memory, and 2) we'll only need to use this SQLite converter tool for legacy data (after we update cytominer-database/transport to output parquet directly)
I'll also answer your questions below:
What resources (memory, CPU, etc.) may we / should we expect for a machine running Pycytominer SQLite conversion work?
Average computer: 64GB, 12-16 core CPU
With resources considered, is there a time duration that we should seek to remain under?
No hard duration, but hours per file seems ok
How do you feel about a multi-file single dataset (moving away from single-file format)?
Multi-file single dataset is not ideal for distribution, so we'd prefer single-file. This conversion tool likely won't be used in a new pipeline (only for legacy datasets). The pipeline from CellProfiler/DeepProfiler -> Cytominer-database/transport goes from multi-files to a single-file. Are there benefits to a multi-file that I'm not thinking of?
Would a single dataset this large be optimal for Pycytomining work?
This file is on the smaller side. They're usually 10-20GB (I've seen some go as high as 50GB!)
10GB post single-cell merge is perfect
Considerations: On the machine mentioned for the tests above, Pandas on it's own struggled to read the resulting full dataset from parquet. Reading a resulting dataset this large likely might benefit from "lazy" (delayed until compute) and/or partitioned queries (specific columns or other constraints) in order to be effective. For example, I used Polars towards this effect here (taking advantage of Arrow and Rust performance through Python). Would it be beneficial to explore libraries which enable this in context with existing code, the developer community, supportability, etc?
Yes! We are definitely interested in these performance upgrades. While it is worth keeping in mind, lazy computation post data conversion is beyond scope of this issue (focused on the converter tool).
How might this dataset be compatible (or otherwise) with other Cytomining community work?
It is on the smaller size, 10-20GB is standard (I've seen SQLite up to 50GB!)
Both of these approaches have the goal of a single large and sparse dataset
Our data are not sparse, they contain mostly non-zero elements.
Now that @bunnech has a great solution for #195 (h/t @johnarevalo), would it be useful to create a tool right here in pycytominer
to convert SQLite to Parquet? @bunnech already has code she is willing to contribute (as a stub, but she might not have the bandwidth to take it all the way through right now)
Ideally, we'd create a separate tool (as @gwaybio suggested https://github.com/cytomining/pycytominer/issues/205#issue-1265100731) in https://github.com/cytomining/ after fully thinking through the design. But I wonder if it's wiser to start with what's easiest right now, which is to take @bunnech's script, clean it up a bit, and create a prototype that will help us think through how we can perfect it over time.
I believe @d33bs is working on this! My understanding is that it is almost done. @d33bs - can you comment?
That’s fantastic - let us know when you get the chance @d33bs.
Current status in #213 FWIW (in case you're curious @shntnu !)
Hi @shntnu and @gwaybio - I feel like the best thing to do would be for me to close #213, taking into account the performance improvements enabled via #219 from @bunnech . I'd love to assist with ongoing efforts here once the stub is in place.
Perfect! Ok, here's how I propose we move forward:
From @shntnu:
take @bunnech's script, clean it up a bit, and create a prototype that will help us think through how we can perfect it over time. @bunnech already has code she is willing to contribute (as a stub, but she might not have the bandwidth to take it all the way through right now)
Sounds great. @bunnech is this something you can take on? Can you file a pull request adding your prototype?
From @d33bs:
I'd love to assist with ongoing efforts here once the stub is in place.
Sounds good! I bet that once @bunnech posts the prototype, you can take charge and apply your learnings from #213 (especially in regards to testing) to take this through to merge.
Yes, sounds good to me. I will add @d33bs as a co-author to the PR already. Will prepare something later today!
@d33bs - should we close this issue?
@gwaybio - yes, I feel this can be closed. I think the work which @bunnech added is great and covers things well. Additionally, the work happening in cytomining/CytoTable#5 (among others) will hopefully add additional capabilities in the near future.
In #202 we discuss how building a SQLite2XXX (where XXX represents our preferred pycytominer data type) would enable us to sunset SQLite dependencies in this repo.
Any tool that we build will likely contain some (or all) of the SQLite hygiene functions @d33bs wrote in #203
I think this tool would be a new package in
cytomining
and exist outside of pycytominer