Closed axiomcura closed 2 years ago
Thanks for documenting this @axiomcura - passing this over to @staylorx to see if he and his team can take a look.
Thank you @gwaygenomics and @axiomcura (really, well done on the docs). We'll take a look.
I've invited @d33bs to take a look, and he's looking into memory profiling now.
Hi @gwaygenomics and @axiomcura - as @staylorx mentioned, I'm looking into memory profiling using SQ00014613.sqlite and the code sample provided in this issue (.merge_single_cells()
). Thanks again for the excellent documentation and references within this issue!
When everything runs, something I'm noticing is that there are two executions of self.load_compartment(compartment=right_compartment)
. Does this seem correct given the data and function involved? I wanted to double check as there's a "right" and a "left" compartment referenced in the code, and we could possibly conserve memory if these remain static throughout the function's operation.
Yep, this is correct. The naming of "right" vs. "left" might not be the best convention, and could actually add confusion.
See here for an example of the compartment_linking_cols
, which is the object storing right and left compartments.
Thank you for the clarification @gwaygenomics !
In running a few tests regarding this issue, I found that one potential improvement to conserve memory revolves around changing a couple pd.dataframe.rename
's to use copy=False, inplace=False
(reference for this). I've made a small commit towards this effect here.
Running the code which @axiomcura posted with the above modification seemed to improve (lower) memory consumption by around 7GB . That said, it's a bit of a tradeoff, as I also saw increased runtime duration of 8 or more minutes each (up from ~28 minutes to ~36 minutes or longer to complete). Would this tradeoff be something worthwhile to consider in the project?
I ran these tests using a GCP e2-highmem-8 machine (8 vCPU, 64 GB Mem). I analyzed memory performance using Memray. Memray provides some helpful flamegraph charts which can narrow down performance issues.
If you're interested in the results in flamegraph form, please take a look below. Note: these links unfortunately won't render as html directly from Github (they're converted to txt when attempting to view the raw contents).
During my investigation of this issue, I noticed two other areas which were major draws on memory:
There are multiple pd.dataframe.merge
's taking place within the function that end up consuming a good deal of memory. It may be beneficial to consider indexing the dataframes involved prior to the merge, as per this reference point.
Another consideration would be to use a Pandas-API-compatible dataframe library which is optimized for parallelization or in other ways. Modin, Mars, and Polars come to mind here as some options (in addition to Vaex mentioned below). These open the possibility to replace pandas for the library itself, or to replace pandas functionality in isolated/intensive functions. It's worth noting they will incur additional but different I/O if used in an isolated manner (many include .[from|to]_pandas
methods).
pd.read_sql
There are multiple reads for the left or right compartments from the source datafile. It may be beneficial to consider reducing these reads (if possible) or switching to another file-based format altogether. Here, Vaex might be helpful to consider with it's I/O options (HDF5, Arrow, Parquet, FITS, etc). Pandas too has some of these alternatives built-in, but may not share the same performance.
I get the feeling there may be some tradeoffs involved with compatibility and existing datasets. To this effect, a preparatory or concluding conversion might be something to consider (see below as an example).
flowchart LR
sqlite[(*.sqlite file)] -->|sqlite_convert| converted["high-throughput\nformat #quot;htp file#quot;"]
converted --> work[pycytominer.work]
work --> |htp_convert| sqlite2[(*.sqlite file)]
I think you've captured our issues well @d33bs - thanks! I'll respond point-by-point with my perspectives.
Would this (memory/runtime)tradeoff be something worthwhile to consider in the project?
Probably, yes. Although, as you allude to in subsequent points, this is really just a bandaid for much larger issues.
It may be beneficial to consider indexing the dataframes involved prior to the merge
I think this is our best bet to resolve this issue without a major overhaul.
One challenge here is that we're using three columns to merge most compartments (["TableNumber", "ImageNumber", "ObjectNumber"]
), and, it uniqueness is not guaranteed. Will that be a problem for index merges?
These open the possibility to replace pandas for the library itself, or to replace pandas functionality in isolated/intensive functions
As pycytominer is a pipeline tool, we are 100% on board with replacing pandas all together. Version 0.1 and before required rapid development, and pandas was the right tool for the job at the time. It's no longer the right tool. However, this change might be beyond scope of this specific issue.
Of the various package options you listed, do you have a sense of how we should decide?
I appreciate how you've broken down your response to these three core subtitles. You've really hit the nail on the head in terms of broadening issues with the code base! Perhaps even a more dire need than replacing pandas, is replacing sqlite. See some notes: cytomining/cytominer-database#96 and cytomining/cytominer-database#124.
The solution for replacing SQLite likely resides in a separate python package https://github.com/cytomining/cytominer-database or https://github.com/cytomining/cytominer-transport
Ideally, pycytominer should be able to handle multiple data input types.
In addition to Vaex, I'd also like to point to https://github.com/scverse/anndata as another option; one that might be closer to our data-specific scientific needs.
To this effect, a preparatory or concluding conversion might be something to consider
Please consider my comments above, and then determine how you'd wish to proceed. I'd love to continue our discussion.
We should be cognizant that this tool is used by other labs, and any discussions we have should be summarized back on github.
Thanks!
Thank you for the great feedback and thoughts @gwaygenomics! There's a ton of awesome work in this repo, including the Pandas data management aspects. The pd.dataframe.rename
and pd.dataframe.set_index
areas are I think specific to Pandas and how it manages data. I'm uncertain whether the non-unique, multi-, and potential additional re-indexing will increase performance. Understanding that we may move away from Pandas and SQLite, this may also not be beneficial long-term.
There's a lot to consider with a dataframe management library change or migration. My feeling is that the choice would be best made based on the type of data involved, the environment in which pycytominer typically runs (resources, OS, etc). That said, understanding what SQLite replacement(s) will be used will help navigate the change.
For the purposes of this specific issue, I plan to make a PR with the pd.dataframe.rename
update to keep our changes here small and practical for the time being. I'd like to continue to assist with the dataframe and data format topics in more isolated issues, as I agree these are a bit outside the scope here.
Appreciate you sharing the https://github.com/cytomining/cytominer-database and https://github.com/cytomining/cytominer-transport repos. Parquet can offer fantastic performance and compression of large datasets - it looks like there's some work already going on there towards this effect. https://github.com/scverse/anndata also looks interesting and may offer improvements here.
Many dataframe and other data management technologies are beginning to adopt Apache Arrow internally or as a deliverable. Because of Arrow's cross-compatibility and performance capabilities it opens many options for data management which may not have been possible before. It also provides some flexibility in dependency selection now and in the future.
Arrow might offer the capability to keep the data format compatible across various file-based and in-mem formats while navigating incremental change (see below).
flowchart LR
sqlite[(SQLite Database)] --> |conversion required| arrow[Arrow Data]
dataset[(Arrow-compatible\nDataset)] --> |direct| arrow[Arrow Data]
arrow --> Pandas.DataFrame
arrow --> Dataframe.X
arrow --> Others["Others (language agnostic)"]
Pandas.DataFrame --> |deprecate| Pycytominer.work
Dataframe.X --> |future-facing\nfor performance| Pycytominer.work
Others --> |integrations or\nadditional fxnality?| Pycytominer.work
Thinking about Arrow's possibilities here and the SQLite file format more generally (a single-file RDBMS) made me think DuckDB might be useful here. This might balance portability with performance and the dependency flexibility I mentioned.
Welcome your thoughts with all the above, and excited to continue work towards these efforts!
Thanks @d33bs - I feel you have a very strong grasp on several potential directions we could take, and perhaps the next step is to define two deprecation strategies for pandas and sqlite.
We and others currently use pycytominer version 0.1 in many projects, and in projects pre-0.1, we used good versioning hygiene with github hashs. Therefore, once we define some deprecation strategies, we should be all set to begin this development on a develop
branch. Happy to have insights into your intuition here as well @d33bs !
A couple other miscellaneous thoughts:
pycytominer
itself. One might be tempted to make another tool, but keeping this tied to the other processing methods likely keeps it out of graveyard longer.
What would a helpful next step be?
Great points @gwaygenomics ! I like the idea of working on a develop
branch together to resolve some of these things. Sorry to hear about odo, it can be tough with data these days - things change quick! 🙂
Would it make sense to point #197 to this develop
branch instead, so as to collect releasable changes?
I think along with what's already been mentioned we could focus on some general repo health enhancements as part of the develop
work. Things like linting, expanded testing guidance, benchmark standardization, etc. will I feel help improve the endurance of the code and our ability to work quickly.
As next steps, how about the following?
My instinct is to work towards the SQLite file reads first, then work on the dataframe scalability. It may turn out that these are too tightly wound together to decouple, and maybe one will determine or resolve the other, or it would make sense to work the other way around.
Would it make sense to point https://github.com/cytomining/pycytominer/pull/197 to this develop branch instead, so as to collect releasable changes?
I think it is ok to merge into master prior, so let's leave it as is.
In regards to strategy, yes! I 💯 concur with the plan you've outlined. Three separate issues, SQLite first, determine how tightly wound SQLite and pandas are (my intuition is that they are not as tightly coupled as you might initially guess), and then start developing! 🛠️
Thank you!
Hi @gwaygenomics - I've begun adding some issues as per our conversations. I don't seem to have access to add branches to this repo, could I ask for your or someone who has access to create the develop
branch (or similar)? Thanks in advance!
Great! I just sent an invite to collaborate on this repo. You should have access now
@gwaygenomics Thank you! I now have the access I need for the branch.
It's a delight to read all the thoughts and ideas that have gone into this issue!
Do I understand correctly that the plan here is to undertake the big overhaul (outlined in https://github.com/cytomining/pycytominer/issues/195#issuecomment-1133329151), and that in turn will address this issue (with .merge_single_cells()
?
Can anyone comment on whether we already have a proposal for a stopgap solution to the original problem reported in this issue? If so, I can look around for help to implement that solution. I read through the issue, but I couldn't figure that out.
Can anyone comment on whether we already have a proposal for a stopgap solution to the original problem reported in this issue?
We do not currently have a stopgap solution, nor is anyone (to my knowlege at least) working on finding one. In other words, our only approach right now is to sidestep with big overhaul
If so, I can look around for help to implement that solution.
That sounds great!
We do not currently have a stopgap solution, nor is anyone (to my knowlege at least) working on finding one. In other words, our only approach right now is to sidestep with big overhaul
Thanks for clarifying
@MarziehHaghighi had created a set of notebooks that capture her typical use cases for working with single-cell data
Specifically, this is how she loads all the single cell data from a SQLite file
@MarziehHaghighi do you recollect how long it takes to load a typical ~10GB SQLite file, as well as the memory requirements, using readSingleCellData_sqlalch
?
I'm fairly confident that @bunnech's #219 solves the issue described in the title. However, I think that some notes here are worth preserving (in someplace other than a closed issue).
Therefore, we could either:
@d33bs and/or @axiomcura - Do you have any intuitions here?
Hi @gwaybio - very excited about #219 (amazing work @bunnech)!
I'd be in favor of closing the issue and continuing to reference it where appropriate. Below are some ideas for new issues based on the conversations here or in related PR's. Open to everyone's thoughts here.
.merge_single_cells()
(or others! 🙂 ) (Something like pytest-memray or similar): automate targeted performance tests to ensure operations within this project are usable by others as appropriate. Maybe this one spot was an outlier, but if there are other performance-heavy functions we could expand there as well.np.nan vs pd.NA vs None
). These could be treated separately and/or on targeted deliverables. Some of this is I think outlined by #79 and maybe other issues.I agree with @d33bs
Many ideas were expressed in this issue that can be their own separate tasks. We should close this issue once the #219 is merged! Thank you @bunnech for your amazing work!
Thanks for documenting all the separate threads discussed in this project @d33bs and @axiomcura - I've created six separate issues (and linked relevant ones).
I think that it is safe to close this issue now, and reserve discussion on these topics in their respective threads.
Small update about the changes that were applied in #219
I ran a memory profile with the same code explained here to measure runtime and memory usage
Total runtime was 8 minutes for a 10 GB file:
real 7m42.803s
user 6m34.286s
sys 0m59.974s
Generated memory profile:
Filename: merge_cells_mem_test.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
3 163.305 MiB 163.305 MiB 1 @profile
4 def run():
5 163.305 MiB 0.000 MiB 1 sql_path = "SQ00014614.sqlite"
6 163.305 MiB 0.000 MiB 1 sql_url = f"sqlite:///{sql_path}"
7
8 309.531 MiB 146.227 MiB 2 sc_p = SingleCells(
9 163.305 MiB 0.000 MiB 1 sql_url,
10 163.305 MiB 0.000 MiB 1 strata=["Image_Metadata_Plate", "Image_Metadata_Well"],
11 163.305 MiB 0.000 MiB 1 image_cols=["TableNumber", "ImageNumber"],
12 163.305 MiB 0.000 MiB 1 fields_of_view_feature=[],
13 )
14
15 10758.113 MiB 10448.582 MiB 1 return sc_p.merge_single_cells()
The resulting output from merge_single_cells()
possesses similar file size as the sqlite
input indicating that no memory leakage is occurring! Overall this a great improvement!
Thanks a lot for reporting this @axiomcura (ps - I love the perfect level of detail in your issues!)
That's mindblowing -- bravo (again) @bunnech 🎉 !
I'll head over to #205 for a related topic.
Currently the
.merge_single_cells()
method in theSingleCell
class uses extraordinarily high amount of memory for a 10GB file.To replicate this:
The data used in this example can be downloaded from here
Specifically, I have downloaded the smallest file, which is
SQ00014613.sqlite
fileMemory benchmarking
To locate where the high memory consumption is occurring, I conducted a memory profile using memory_profiler
I copied the
.merge_single_cells()
source code and placed it into a script. Then, I replacedself
with theSingleCell
instance variable, which issc_p
in this exampleBelow is the memory profile output:
The high consumption of memory is occurring within the
.load_compartment()
method (line 63) going from 298MB to 21792MB.Note that this profile was generated by only running one iteration (
exit()
statements at the bottom).The computer will start using all the memory when starting the next iteration.
Aditional information
It seems several of people have similar issues using pandas to read database files:
Add doc note on memory usage of read_sql with chunksize
Loading SQL data into Pandas without running out of memory
https://github.com/pandas-dev/pandas/issues/40847.
There is an article that explain potential methods for optimizing I/O and memory usage with postgres files, which may be applicable to the sqlite files used in
pycytominer
.Machine Information
@gwaygenomics