Scheduling a SQLite deprecation in pycytominer

cytomining / pycytominer

Python package for processing image-based profiling data

https://pycytominer.readthedocs.io

BSD 3-Clause "New" or "Revised" License

79 stars 35 forks source link

Scheduling a SQLite deprecation in pycytominer #202

Open gwaybio opened 2 years ago

gwaybio commented 2 years ago

In #198 @d33bs states:

Setting a goal post of when SQLite will be unsupported might help make some of the judgements surrounding what to do presently [specifically for handling memory leak #195]

It has been well documented in many previous issues (in this repo and others) that we need to move away from SQLite towards a more performant data type. This requires some level of deprecation (either complete or partial).

Let's use this issue to describe:

What should happen with SQLite support?
When should this happen?

I believe that we need more info from key stakeholders to help us answer. @bethac07 can provide much of this insight. @bethac07 - what are your resource needs for future data types? What SQLite support would you like to see in pycytominer?

Note @d33bs - I think that we can safely proceed with some elements of our plan in #198 (e.g. function for cleaning existing SQLite, upgrading single cell merges, etc.). I don't want this issue to block any progress - let's use the detail outlined here as a guide on how to maintain critical SQLIte support and when to release new pycytominer versions. Old pycytominer versions (e.g. v0.1) will retain full SQLite support.

bethac07 commented 2 years ago

I think it would be good to loop in @shntnu as well here.

I can understand wanting to also support other data sources (ie parquet), but is there a reason at this time to HAVE to sunset SQLite support? For better or for worse, for at least the forseeable future it's what a lot of the "older" data sets in the field are going to have available in terms of single cell data, so you'll either lose the ability to work with those with new and awesome pycytominer stuff past 0.1 or lose the ability to compare against these sets until you do conversions.

It doesn't look like SQLite is terribly woven throughout the codebase - from what I can tell, it's literally only in cells.py (and some tests, and in collate but we aren't sure if that will live in this repo forever). My proposed solution whether or not you decide to sunset SQLite anytime soon or not would be to essentially extract out a "data reader" module or class that just handles data reading and formatting instead of having that be part of the SingleCells class; that way everything else in pycytominer remains source-agnostic, and you can easily add or remove source formats as you wish.

shntnu commented 2 years ago

Disclaimer: I've not been in the trenches for a while on this front :) @bethac07 will have a much better sense of ground realities than me.

I skimmed #198, and it sounds like the future alternative format is TBD. I'll assume this format is Parquet (for this discussion)

What SQLite support would you like to see in pycytominer? What should happen with SQLite support?

If we have a SQLite2Parquet converter, I think it's ok to sunset SQLite entirely (and it would be worth the whatever rewrite pycytominer needs to make it happen)

When should this happen?

IMO as soon as a SQLite2Parquet converter has been written and tested, and https://github.com/cytomining/profiling-recipe/blob/d7012751cf24e9aa1325e099b03aaf74fe2578d6/profiles/profile.py#L54 has been updated to optionally convert the SQLite to Parquet

what are your resource needs for future data types?

I didn't understand this question, but I think you're asking: what are the requirements for the SQLite replacement? Beth would have a much better handle on that

gwaybio commented 2 years ago

Thanks for this discussion! I'll respond to several comments below, but my overall takeaway for the answer to scheduling a SQLite deprecation is: immediately, but with direction on how to do so.

Continued discussion responses

but is there a reason at this time to HAVE to sunset SQLite support?

We'd like to make progress with pycytominer (and the full pipeline), and in order to do this, we need to move away from this file format. An alternative approach is to build a completely separate tool (using lessons learned in pycytominer), but we'd prefer to continue long-term support for CellProfiler-based Cell Painting analyses within cytomining.

My proposed solution whether or not you decide to sunset SQLite anytime soon or not would be to essentially extract out a "data reader" module or class that just handles data reading and formatting instead of having that be part of the SingleCells class

If we have a SQLite2Parquet converter, I think it's ok to sunset SQLite entirely (and it would be worth the whatever rewrite pycytominer needs to make it happen)

Yes! I think we should pursue both suggestions. This is the how feedback I was looking for! 👀

Let's develop aSQLite2XXX converter package (plus implementing collate in the pipeline) and sunset SQLite in pycytominer completely. The pipeline will use the converter (if necessary), thus enabling long-term SQLite support while removing SQLite dependency in pycytominer.

Let's also refactor to extract a data reader module from SingleCells - we can enforce data type inputs, and throw warnings/point to solutions if alternative data types are attempted.

bethac07 commented 2 years ago

I just want to make sure also though to put this problem in a larger context- I think an SQLiteToX (where X is Parquet or whatever else) is a good idea no matter what, if only for legacy data.

But we should also consider if whatever it is we were going to build, it makes sense to also think about what we still expect to be the most common data source for a while, which is "per site CSVs coming from CellProfiler". Doing CSV -> SQLite -> X is fine, but if it helps in the selection of X or if now is the right time to talk about it as we're thinking about new refactored data readers, CSV -> X is obviously better. I'm also hoping that this will finally be the year we do the big CellProfiler Export module overhaul, so hopefully we can write to X directly too in some circumstances, but at least for the forseeable I think CSVs are a safe "everything should be able to read and write them" option.

shntnu commented 2 years ago

Doing CSV -> SQLite -> X is fine, but if it helps in the selection of X or if now is the right time to talk about it as we're thinking about new refactored data readers, CSV -> X is obviously better

Ah, in my note above I had kinda assumed that creating a CSV -> X tool is part of the plan here and that we'd never ever need to do CSV -> SQLite -> X.

@gwaybio is creating a CSV -> X tool (i.e. a cytominer-database replacement) a part of your plan here?

bethac07 commented 2 years ago

My straw proposal for what the right thing to do here is to try to finish cytominer-transport . When we last left it off 14 months ago, it was working but slow; I think I have some ideas on how to make it less slow.

gwaybio commented 2 years ago

is creating a CSV -> X tool (i.e. a cytominer-database replacement) a part of your plan here?

It's definitely part of the conversation, as it pertains to resource needs for future datasets.

I think finishing cytominer-transport is the way to go, and what we learn as we go should inform some pycytominer decisions.

The development of cytominer-transport, however, should not impact pycytominer progress. The plan to sunset SQLite in the manner described here https://github.com/cytomining/pycytominer/issues/202#issuecomment-1150211828 removes the development tangle that might prevent progress in one package vs. another.

shntnu commented 2 years ago

The plan to sunset SQLite in the manner described here #202 (comment) removes the development tangle that might prevent progress in one package vs. another.

I couldn't figure out if the sunset plan requires you to first decide what X should be.

Either way, very glad you're on top of this @gwaybio!

gwaybio commented 2 years ago

It will eventually, we don't know what it is yet, and I know @d33bs has a bunch of ideas :) (some descriptions in #195 )

d33bs commented 2 years ago

Just wanted to chime in here to offer up some ideas. There's an opportunity to do some great things in this project with scalable dataframes from an in-memory perspective, and many of these include or are compatible with parquet files as exports. Parquet looks to be used in at least some of this project which is great news for compatibility with these sorts of things 🙂 .

Parquet is pretty great when it comes to both performance, file-size, and compatibility with other projects. It's compatibility makes it reasonable to think we could convert from it once a better alternative is known, for ex as parquet2xxx. That said, would it make sense to somewhat formalize on Parquet as the in-file format for this data for the time being? Maybe we could leave things open-ended and focus on the dataflow we need now as a repo called sqlite-converter (or continue work on an existing one)?

If this seems like a good idea, what type of data-to-file structure should we expect? Some options I can think of:

1-to-1: one table becomes one parquet file
- Some of this work may be resolved already with an existing repo: sqlite2parquet (more testing needed here).
1-to-many: one table becomes many chunks into parquet files (scale outward so as avoid limitations of single files)
- For example, see Ray's documentation on Saving Datasets
many-to-1: many tables become one parquet file (likely involves multidimensional data transforms)
- More below...

Multidimensional Thoughts

Working within the contexts of multidimensional table(s) could be helpful for calculations and also for encapsulating the file-based version of the data. This wouldn't be without challenges, as it may completely change how operations are performed. Ideally this would open up roads that otherwise might not exist, and where it's limiting, we could abstract or transform to a more familiar 2-dimensional representation of the data (maybe as a query, or a class method, etc).

A sketch of what this might look like:

Unique Index

Cells

Cytoplasm

Nucleus

cells_col1	cells_colx...
0.1	0.5

cyto_col1	cyto_colx...
0.2	0.8

nuc_col1	nuc_colx...
0.05	0.7

...

Some ideas in this space:

xarray could offer the ability to store and retrieve this type of data via the netCDF format.
Tensors there are many different types of tensor types, Arrow might be a good choice for other in-memory operations that could enhance performance for this repo.
(This feels like a bit of a leap, but wanted to share just the same) Geometric data formats come to mind: what does the "cartographic map" of the cellular image data contain? How well structured is the "neighborhood" or "network" of the cellular image? GeoPandas extends Pandas to include GeoSeries which include Shapely objects. Geodataframes may be exported as various filetypes, parquet included. Even if the types don't align well with the data involved here, we could probably learn from GeoPanda's I/O work and possibly loop that into development progress somehow.

gwaybio commented 2 years ago

Thanks @d33bs! Some additional discussion:

Parquet is pretty great when it comes to both performance, file-size, and compatibility with other projects. It's compatibility makes it reasonable to think we could convert from it once a better alternative is known, for ex as parquet2xxx. That said, would it make sense to somewhat formalize on Parquet as the in-file format for this data for the time being?

Yes, I think this makes sense to do. We've deliberated on parquet in the past, we've been limited on bandwidth/knowledge to implement the switch.

Maybe we could leave things open-ended and focus on the dataflow we need now as a repo called sqlite-converter (or continue work on an existing one)?

Yes, I think this makes sense to do as well. We can create this repo, and start with a multi-table SQLite -> parquet implementation. IIRC, this repo would also contain some of the SQLite hygiene functions introduced in #203

Once we have this functionality, then we can effectively deprecate SQLite in future pycytominer releases by:

Adding the SQLite -> parquet tool to the existing workflow
Work towards deprecating SQLite from cytominer-database/cytominer-transport

Working within the contexts of multidimensional table(s) could be helpful for calculations and also for encapsulating the file-based version of the data. This wouldn't be without challenges, as it may completely change how operations are performed.

Is there a reason to not perform merges across tables prior to parquet writing? This to me seems like the simplest solution that wouldn't require much operational change.

I'm definitely open to more thoughts on this topic!

d33bs commented 2 years ago

Thank you @gwaybio !

Yes, I think this makes sense to do as well. We can create this repo, and start with a multi-table SQLite -> parquet implementation. IIRC, this repo would also contain some of the SQLite hygiene functions introduced in https://github.com/cytomining/pycytominer/pull/203

Whenever ready, I'd be excited to assist with this effort and include the hygiene functions as well. Are there any other considerations before creating the repo we should think through? Please don't hesitate to let me know if there's anything I may help with in getting started here.

Is there a reason to not perform merges across tables prior to parquet writing? This to me seems like the simplest solution that wouldn't require much operational change.

I can see some reasons to not merge prior to parquet writes (there may be others based on data implementation, etc.):

Multidimensional datasets can be large and may imply using technologies which are incompatible for some systems. (parallel procedures, atypical data formats, associated libraries etc.)
The understandability of the dataset might be decreased, making development work more challenging in some circumstances. We can provide documentation to this effect, but then we may still have learning curves to accommodate. This is I feel a subjective area where it's completely possible I'm off-base given the community and in-use data schema that may already be familiar to developers.
When targeting the equivalent of single table data from a multidimensional source, we may introduce the need for a different translational pre-step (instead of a "select * from ...") which could have additional performance and complexity costs.

All this said, I still feel there might be benefits to this approach. Maybe this becomes a flag in the sqlite-converter for a new multidimensional single-file format? If performance becomes an issue between multidimensional and 1-to-1, we might consider a "decompress" and "compress" step (decomposing the multidimensional file into it's parts for operations). It may be possible to do this decompress in-memory so as to use the existing dataframe structures as-is without the need for greater disk space.

gwaybio commented 2 years ago

Whenever ready, I'd be excited to assist with this effort and include the hygiene functions as well.

We are ready now! No time like the present :)

Are there any other considerations before creating the repo we should think through?

Hmm, I can only think of two:

Which organization the repo should live in
- My thoughts: Given the goal (to my understanding at least), is to create a generic package for SQLite conversion and hygiene, that will have initial motivation routed in our specific use case (but will be expanded later), I think you and @staylorx should decide. I'd be happy to host it in WayScience, but perhaps it's better off elsewhere.
Are we sure we aren't reinventing the wheel?
- My thoughts: If a well maintained repo with a good community already exists that performs similar functionality, would it make sense to contribute our specific needs there instead? I have not looked extensively.

I can see some reasons to not merge prior to parquet writes (there may be others based on data implementation, etc.):

Sorry, I don't think I understand your replies. I am talking about merging the individual per-cell measurements that exist in three separate tables (compartments) in the existing SQLite files prior to parquet writing. Why does performing this merge make the data multidimensional? Isn't it just a simple file that could just as well be a flat text file? What am I not understanding here? 😂

d33bs commented 2 years ago

Thank you @gwaybio - after thinking this over a bit and discussing with @staylorx, I feel it'd be best to store the conversion code under this repo for the time being until there's more clarity around the utility of code reuse. Getting the merged data schema in a desired state is likely to result in code that's specific to this project - which makes it a good candidate to be included here (for example, under cyto_utils).

Multidimensional - my apologies, I think I was getting carried away with another option. A merged dataset will be 2-dimensional and should enable us to keep things in a single file. Thanks as well for the great comments within #205 towards understanding the existing datasets. I'll plan to follow up within that issue for specifics when it comes to the fully merged data schema.

gwaybio commented 2 years ago

Sounds good @d33bs - I saw https://github.com/CUHealthAI/sqlite-clean in the CUHealthAI org. Is the plan to eventually migrate the conversion code here? (I'm just noting this here for decision tracking)

d33bs commented 2 years ago

Hi @gwaybio, great question, I'll try to clarify below:

sqlite-clean is intended to be a generalized tool for SQLite database linting and related fixes (similar to Python black --check and black <file> respectively) inspired by the conversations + findings from issues within this project. I see this as something we could eventually bring in as a dependency if desired, but we can also keep things decoupled for the sake of development flexibility in the meantime.
The conversion work mentioned in this issue I see as being a Pycytominer-specific data transformation tool, and something we can keep within this repo under cytoutils nearby or within sqlite.py. I've been working through code and performance when it comes to how to do this, an early sketch of how we could do this can be [found here (using only existing dependencies via Pandas)](https://github.com/CUHealthAI/pycytominer-performance/blob/main/src/pycytominer%23205/databaseframe_pandas_merged.py). This recycled some of the utilities from "cleaning" efforts, but doesn't make a hard requirement of them. It also enforces a number of presumptions (table joining, join keys, compartments) that are dedicated to the work here (making it less generalizable).

Some other notes towards progress:

Conversion performance - given how large the resulting single dataframe may be when working with >= 10GB SQLite files, I've been exploring how to use scalable in-memory structures. There are some additional "databaseframe's" I tried using Arrow and Ray but I found these may spur additional source SQLite database issues (mainly when converting SQLite to Arrow data). I'm uncertain yet if resolving these issues will result in value for Pycytominer, and more investigation is needed here.
A little aside, but wanted to share: the Arrow project has recently added join functionality which could prove useful for some of the performance challenges we've seen using pandas.DataFrames. This might be helpful when it comes to the conversion work here and could extend into other issue resolution.