Open gwaybio opened 2 years ago
I think it would be good to loop in @shntnu as well here.
I can understand wanting to also support other data sources (ie parquet), but is there a reason at this time to HAVE to sunset SQLite support? For better or for worse, for at least the forseeable future it's what a lot of the "older" data sets in the field are going to have available in terms of single cell data, so you'll either lose the ability to work with those with new and awesome pycytominer stuff past 0.1 or lose the ability to compare against these sets until you do conversions.
It doesn't look like SQLite is terribly woven throughout the codebase - from what I can tell, it's literally only in cells.py (and some tests, and in collate but we aren't sure if that will live in this repo forever). My proposed solution whether or not you decide to sunset SQLite anytime soon or not would be to essentially extract out a "data reader" module or class that just handles data reading and formatting instead of having that be part of the SingleCells class; that way everything else in pycytominer remains source-agnostic, and you can easily add or remove source formats as you wish.
Disclaimer: I've not been in the trenches for a while on this front :) @bethac07 will have a much better sense of ground realities than me.
I skimmed #198, and it sounds like the future alternative format is TBD. I'll assume this format is Parquet (for this discussion)
What SQLite support would you like to see in pycytominer? What should happen with SQLite support?
If we have a SQLite2Parquet converter, I think it's ok to sunset SQLite entirely (and it would be worth the whatever rewrite pycytominer needs to make it happen)
When should this happen?
IMO as soon as a SQLite2Parquet converter has been written and tested, and https://github.com/cytomining/profiling-recipe/blob/d7012751cf24e9aa1325e099b03aaf74fe2578d6/profiles/profile.py#L54 has been updated to optionally convert the SQLite to Parquet
what are your resource needs for future data types?
I didn't understand this question, but I think you're asking: what are the requirements for the SQLite replacement? Beth would have a much better handle on that
Thanks for this discussion! I'll respond to several comments below, but my overall takeaway for the answer to scheduling a SQLite deprecation is: immediately, but with direction on how to do so.
but is there a reason at this time to HAVE to sunset SQLite support?
We'd like to make progress with pycytominer (and the full pipeline), and in order to do this, we need to move away from this file format. An alternative approach is to build a completely separate tool (using lessons learned in pycytominer), but we'd prefer to continue long-term support for CellProfiler-based Cell Painting analyses within cytomining.
My proposed solution whether or not you decide to sunset SQLite anytime soon or not would be to essentially extract out a "data reader" module or class that just handles data reading and formatting instead of having that be part of the SingleCells class
If we have a SQLite2Parquet converter, I think it's ok to sunset SQLite entirely (and it would be worth the whatever rewrite pycytominer needs to make it happen)
Yes! I think we should pursue both suggestions. This is the how feedback I was looking for! 👀
Let's develop aSQLite2XXX
converter package (plus implementing collate in the pipeline) and sunset SQLite in pycytominer completely. The pipeline will use the converter (if necessary), thus enabling long-term SQLite support while removing SQLite dependency in pycytominer.
Let's also refactor to extract a data reader module from SingleCells
- we can enforce data type inputs, and throw warnings/point to solutions if alternative data types are attempted.
I just want to make sure also though to put this problem in a larger context- I think an SQLiteToX (where X is Parquet or whatever else) is a good idea no matter what, if only for legacy data.
But we should also consider if whatever it is we were going to build, it makes sense to also think about what we still expect to be the most common data source for a while, which is "per site CSVs coming from CellProfiler". Doing CSV -> SQLite -> X is fine, but if it helps in the selection of X or if now is the right time to talk about it as we're thinking about new refactored data readers, CSV -> X is obviously better. I'm also hoping that this will finally be the year we do the big CellProfiler Export module overhaul, so hopefully we can write to X directly too in some circumstances, but at least for the forseeable I think CSVs are a safe "everything should be able to read and write them" option.
Doing CSV -> SQLite -> X is fine, but if it helps in the selection of X or if now is the right time to talk about it as we're thinking about new refactored data readers, CSV -> X is obviously better
Ah, in my note above I had kinda assumed that creating a CSV -> X tool is part of the plan here and that we'd never ever need to do CSV -> SQLite -> X.
@gwaybio is creating a CSV -> X tool (i.e. a cytominer-database replacement) a part of your plan here?
My straw proposal for what the right thing to do here is to try to finish cytominer-transport . When we last left it off 14 months ago, it was working but slow; I think I have some ideas on how to make it less slow.
is creating a CSV -> X tool (i.e. a cytominer-database replacement) a part of your plan here?
It's definitely part of the conversation, as it pertains to resource needs for future datasets.
I think finishing cytominer-transport is the way to go, and what we learn as we go should inform some pycytominer decisions.
The development of cytominer-transport, however, should not impact pycytominer progress. The plan to sunset SQLite in the manner described here https://github.com/cytomining/pycytominer/issues/202#issuecomment-1150211828 removes the development tangle that might prevent progress in one package vs. another.
The plan to sunset SQLite in the manner described here #202 (comment) removes the development tangle that might prevent progress in one package vs. another.
I couldn't figure out if the sunset plan requires you to first decide what X should be.
Either way, very glad you're on top of this @gwaybio!
It will eventually, we don't know what it is yet, and I know @d33bs has a bunch of ideas :) (some descriptions in #195 )
Just wanted to chime in here to offer up some ideas. There's an opportunity to do some great things in this project with scalable dataframes from an in-memory perspective, and many of these include or are compatible with parquet files as exports. Parquet looks to be used in at least some of this project which is great news for compatibility with these sorts of things 🙂 .
Parquet is pretty great when it comes to both performance, file-size, and compatibility with other projects. It's compatibility makes it reasonable to think we could convert from it once a better alternative is known, for ex as parquet2xxx
. That said, would it make sense to somewhat formalize on Parquet as the in-file format for this data for the time being? Maybe we could leave things open-ended and focus on the dataflow we need now as a repo called sqlite-converter
(or continue work on an existing one)?
If this seems like a good idea, what type of data-to-file structure should we expect? Some options I can think of:
sqlite2parquet
(more testing needed here). Working within the contexts of multidimensional table(s) could be helpful for calculations and also for encapsulating the file-based version of the data. This wouldn't be without challenges, as it may completely change how operations are performed. Ideally this would open up roads that otherwise might not exist, and where it's limiting, we could abstract or transform to a more familiar 2-dimensional representation of the data (maybe as a query, or a class method, etc).
A sketch of what this might look like:
Unique Index | Cells | Cytoplasm | Nucleus | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a1 |
|
|
|
||||||||||||
... | ... | ... | ... |
Some ideas in this space:
Thanks @d33bs! Some additional discussion:
Parquet is pretty great when it comes to both performance, file-size, and compatibility with other projects. It's compatibility makes it reasonable to think we could convert from it once a better alternative is known, for ex as parquet2xxx. That said, would it make sense to somewhat formalize on Parquet as the in-file format for this data for the time being?
Yes, I think this makes sense to do. We've deliberated on parquet in the past, we've been limited on bandwidth/knowledge to implement the switch.
Maybe we could leave things open-ended and focus on the dataflow we need now as a repo called sqlite-converter (or continue work on an existing one)?
Yes, I think this makes sense to do as well. We can create this repo, and start with a multi-table SQLite -> parquet implementation. IIRC, this repo would also contain some of the SQLite hygiene functions introduced in #203
Once we have this functionality, then we can effectively deprecate SQLite in future pycytominer releases by:
Working within the contexts of multidimensional table(s) could be helpful for calculations and also for encapsulating the file-based version of the data. This wouldn't be without challenges, as it may completely change how operations are performed.
Is there a reason to not perform merges across tables prior to parquet writing? This to me seems like the simplest solution that wouldn't require much operational change.
I'm definitely open to more thoughts on this topic!
Thank you @gwaybio !
Yes, I think this makes sense to do as well. We can create this repo, and start with a multi-table SQLite -> parquet implementation. IIRC, this repo would also contain some of the SQLite hygiene functions introduced in https://github.com/cytomining/pycytominer/pull/203
Whenever ready, I'd be excited to assist with this effort and include the hygiene functions as well. Are there any other considerations before creating the repo we should think through? Please don't hesitate to let me know if there's anything I may help with in getting started here.
Is there a reason to not perform merges across tables prior to parquet writing? This to me seems like the simplest solution that wouldn't require much operational change.
I can see some reasons to not merge prior to parquet writes (there may be others based on data implementation, etc.):
select * from ...
") which could have additional performance and complexity costs. All this said, I still feel there might be benefits to this approach. Maybe this becomes a flag in the sqlite-converter
for a new multidimensional single-file format? If performance becomes an issue between multidimensional and 1-to-1, we might consider a "decompress" and "compress" step (decomposing the multidimensional file into it's parts for operations). It may be possible to do this decompress in-memory so as to use the existing dataframe structures as-is without the need for greater disk space.
Whenever ready, I'd be excited to assist with this effort and include the hygiene functions as well.
We are ready now! No time like the present :)
Are there any other considerations before creating the repo we should think through?
Hmm, I can only think of two:
I can see some reasons to not merge prior to parquet writes (there may be others based on data implementation, etc.):
Sorry, I don't think I understand your replies. I am talking about merging the individual per-cell measurements that exist in three separate tables (compartments) in the existing SQLite files prior to parquet writing. Why does performing this merge make the data multidimensional? Isn't it just a simple file that could just as well be a flat text file? What am I not understanding here? 😂
Thank you @gwaybio - after thinking this over a bit and discussing with @staylorx, I feel it'd be best to store the conversion code under this repo for the time being until there's more clarity around the utility of code reuse. Getting the merged data schema in a desired state is likely to result in code that's specific to this project - which makes it a good candidate to be included here (for example, under cyto_utils).
Multidimensional - my apologies, I think I was getting carried away with another option. A merged dataset will be 2-dimensional and should enable us to keep things in a single file. Thanks as well for the great comments within #205 towards understanding the existing datasets. I'll plan to follow up within that issue for specifics when it comes to the fully merged data schema.
Sounds good @d33bs - I saw https://github.com/CUHealthAI/sqlite-clean in the CUHealthAI org. Is the plan to eventually migrate the conversion code here? (I'm just noting this here for decision tracking)
Hi @gwaybio, great question, I'll try to clarify below:
black --check
and black <file>
respectively) inspired by the conversations + findings from issues within this project. I see this as something we could eventually bring in as a dependency if desired, but we can also keep things decoupled for the sake of development flexibility in the meantime.sqlite.py
. I've been working through code and performance when it comes to how to do this, an early sketch of how we could do this can be [found here (using only existing dependencies via Pandas)](https://github.com/CUHealthAI/pycytominer-performance/blob/main/src/pycytominer%23205/databaseframe_pandas_merged.py). This recycled some of the utilities from "cleaning" efforts, but doesn't make a hard requirement of them. It also enforces a number of presumptions (table joining, join keys, compartments) that are dedicated to the work here (making it less generalizable).Some other notes towards progress:
join
functionality which could prove useful for some of the performance challenges we've seen using pandas.DataFrames. This might be helpful when it comes to the conversion work here and could extend into other issue resolution.
In #198 @d33bs states:
It has been well documented in many previous issues (in this repo and others) that we need to move away from SQLite towards a more performant data type. This requires some level of deprecation (either complete or partial).
Let's use this issue to describe:
I believe that we need more info from key stakeholders to help us answer. @bethac07 can provide much of this insight. @bethac07 - what are your resource needs for future data types? What SQLite support would you like to see in pycytominer?
Note @d33bs - I think that we can safely proceed with some elements of our plan in #198 (e.g. function for cleaning existing SQLite, upgrading single cell merges, etc.). I don't want this issue to block any progress - let's use the detail outlined here as a guide on how to maintain critical SQLIte support and when to release new pycytominer versions. Old pycytominer versions (e.g. v0.1) will retain full SQLite support.