Potential memory leak in SingleCell's `.merge_single_cells()` method

axiomcura commented 2 years ago

Currently the .merge_single_cells() method in the SingleCell class uses extraordinarily high amount of memory for a 10GB file.

To replicate this:

from pycytominer.cyto_utils.cells import SingleCells

sql_path = "SQ00014614.sqlite"
sql_url = f"sqlite:///{sql_path}"

sc_p = SingleCells(
    sql_url,
    strata=["Image_Metadata_Plate", "Image_Metadata_Well"],
    image_cols=["TableNumber", "ImageNumber"],
    fields_of_view_feature=[],
)

merged_sc = sc_p.merge_single_cells()

The data used in this example can be downloaded from here

Specifically, I have downloaded the smallest file, which is SQ00014613.sqlite file

Memory benchmarking

To locate where the high memory consumption is occurring, I conducted a memory profile using memory_profiler

I copied the .merge_single_cells() source code and placed it into a script. Then, I replaced self with the SingleCell instance variable, which is sc_p in this example

Below is the memory profile output:


298.078 MiB  146.570 MiB           2       sc_p = SingleCells(
    30  151.508 MiB    0.000 MiB           1           sqlite_file, strata=strata, image_cols=image_cols, fields_of_view_feature=[]
    31                                             )
    32
    33  298.078 MiB    0.000 MiB           1       sc_df = ""
    34  298.078 MiB    0.000 MiB           1       linking_check_cols = []
    35  298.078 MiB    0.000 MiB           1       merge_suffix_rename = []
    36  298.078 MiB    0.000 MiB           1       for left_compartment in sc_p.compartment_linking_cols:
    37  298.078 MiB    0.000 MiB           1           for right_compartment in sc_p.compartment_linking_cols[left_compartment]:
    38
    39  298.078 MiB    0.000 MiB           1               print(right_compartment)
    40                                                     # linking checking
    41  298.078 MiB    0.000 MiB           1               linking_check = "-".join(sorted([left_compartment, right_compartment]))
    42  298.078 MiB    0.000 MiB           1               if linking_check in linking_check_cols:
    43                                                         continue
    44
    45
    46                                                     # specfying merges
    47  298.078 MiB    0.000 MiB           1               merge_suffix = [
    48  298.078 MiB    0.000 MiB           1                   "_{comp_l}".format(comp_l=left_compartment),
    49  298.078 MiB    0.000 MiB           1                   "_{comp_r}".format(comp_r=right_compartment),
    50                                                     ]
    51  298.078 MiB    0.000 MiB           1               merge_suffix_rename += merge_suffix
    52  298.078 MiB    0.000 MiB           2               left_link_col = sc_p.compartment_linking_cols[left_compartment][
    53  298.078 MiB    0.000 MiB           1                   right_compartment
    54                                                     ]
    55  298.078 MiB    0.000 MiB           2               right_link_col = sc_p.compartment_linking_cols[right_compartment][
    56  298.078 MiB    0.000 MiB           1                   left_compartment
    57                                                     ]
    58
    59
    60                                                     # constantly loading dfs (memory issue might be here!)
    61  298.078 MiB    0.000 MiB           1               if isinstance(sc_df, str):
    62                                                         #  loading df (a table)
    63 21792.168 MiB 21494.090 MiB           1                   initial_df = sc_p.load_compartment(compartment=left_compartment)
    64
    65                                                         # loading another df (a table)
    66 31550.000 MiB 1909.414 MiB           2                   sc_df = initial_df.merge(
    67 29640.586 MiB 7848.418 MiB           1                       sc_p.load_compartment(compartment=right_compartment),
    68 29640.586 MiB    0.000 MiB           1                       left_on=sc_p.merge_cols + [left_link_col],
    69 29640.586 MiB    0.000 MiB           1                       right_on=sc_p.merge_cols + [right_link_col],
    70 29640.586 MiB    0.000 MiB           1                       suffixes=merge_suffix
    71                                                         )
    72
    73                                                         # delete initial df
    74
    75 31550.000 MiB    0.000 MiB           1               break
    76 31550.000 MiB    0.000 MiB           1           exit()

The high consumption of memory is occurring within the .load_compartment() method (line 63) going from 298MB to 21792MB.

Note that this profile was generated by only running one iteration (exit() statements at the bottom).

The computer will start using all the memory when starting the next iteration.

Aditional information

It seems several of people have similar issues using pandas to read database files:

There is an article that explain potential methods for optimizing I/O and memory usage with postgres files, which may be applicable to the sqlite files used in pycytominer.

Machine Information

OS: PopOS! 20.01 LTS
CPU: AMD® Ryzen threadripper 2970wx 24-core processor × 48
GPU: NVIDIA GeForce RTX 3060
Memory: 64 GB RAM

@gwaygenomics

gwaybio commented 2 years ago

Thanks for documenting this @axiomcura - passing this over to @staylorx to see if he and his team can take a look.

staylorx commented 2 years ago

Thank you @gwaygenomics and @axiomcura (really, well done on the docs). We'll take a look.

staylorx commented 2 years ago

I've invited @d33bs to take a look, and he's looking into memory profiling now.

d33bs commented 2 years ago

Hi @gwaygenomics and @axiomcura - as @staylorx mentioned, I'm looking into memory profiling using SQ00014613.sqlite and the code sample provided in this issue (.merge_single_cells()). Thanks again for the excellent documentation and references within this issue!

When everything runs, something I'm noticing is that there are two executions of self.load_compartment(compartment=right_compartment) . Does this seem correct given the data and function involved? I wanted to double check as there's a "right" and a "left" compartment referenced in the code, and we could possibly conserve memory if these remain static throughout the function's operation.

gwaybio commented 2 years ago

Yep, this is correct. The naming of "right" vs. "left" might not be the best convention, and could actually add confusion.

See here for an example of the compartment_linking_cols, which is the object storing right and left compartments.

d33bs commented 2 years ago

Thank you for the clarification @gwaygenomics !

In running a few tests regarding this issue, I found that one potential improvement to conserve memory revolves around changing a couple pd.dataframe.rename's to use copy=False, inplace=False (reference for this). I've made a small commit towards this effect here.

Running the code which @axiomcura posted with the above modification seemed to improve (lower) memory consumption by around 7GB . That said, it's a bit of a tradeoff, as I also saw increased runtime duration of 8 or more minutes each (up from ~28 minutes to ~36 minutes or longer to complete). Would this tradeoff be something worthwhile to consider in the project?

I ran these tests using a GCP e2-highmem-8 machine (8 vCPU, 64 GB Mem). I analyzed memory performance using Memray. Memray provides some helpful flamegraph charts which can narrow down performance issues.

If you're interested in the results in flamegraph form, please take a look below. Note: these links unfortunately won't render as html directly from Github (they're converted to txt when attempting to view the raw contents).

Other Considerations

During my investigation of this issue, I noticed two other areas which were major draws on memory:

Pandas DataFrame Merges

There are multiple pd.dataframe.merge's taking place within the function that end up consuming a good deal of memory. It may be beneficial to consider indexing the dataframes involved prior to the merge, as per this reference point.

Another consideration would be to use a Pandas-API-compatible dataframe library which is optimized for parallelization or in other ways. Modin, Mars, and Polars come to mind here as some options (in addition to Vaex mentioned below). These open the possibility to replace pandas for the library itself, or to replace pandas functionality in isolated/intensive functions. It's worth noting they will incur additional but different I/O if used in an isolated manner (many include .[from|to]_pandas methods).

SQLite reads from `pd.read_sql`

There are multiple reads for the left or right compartments from the source datafile. It may be beneficial to consider reducing these reads (if possible) or switching to another file-based format altogether. Here, Vaex might be helpful to consider with it's I/O options (HDF5, Arrow, Parquet, FITS, etc). Pandas too has some of these alternatives built-in, but may not share the same performance.

I get the feeling there may be some tradeoffs involved with compatibility and existing datasets. To this effect, a preparatory or concluding conversion might be something to consider (see below as an example).

flowchart LR
    sqlite[(*.sqlite file)] -->|sqlite_convert| converted["high-throughput\nformat #quot;htp file#quot;"]
    converted --> work[pycytominer.work]
    work --> |htp_convert| sqlite2[(*.sqlite file)]

gwaybio commented 2 years ago

I think you've captured our issues well @d33bs - thanks! I'll respond point-by-point with my perspectives.

Modifying pd.DataFrame.rename

Would this (memory/runtime)tradeoff be something worthwhile to consider in the project?

Probably, yes. Although, as you allude to in subsequent points, this is really just a bandaid for much larger issues.

Pandas merges

It may be beneficial to consider indexing the dataframes involved prior to the merge

I think this is our best bet to resolve this issue without a major overhaul.

One challenge here is that we're using three columns to merge most compartments (["TableNumber", "ImageNumber", "ObjectNumber"]), and, it uniqueness is not guaranteed. Will that be a problem for index merges?

These open the possibility to replace pandas for the library itself, or to replace pandas functionality in isolated/intensive functions

As pycytominer is a pipeline tool, we are 100% on board with replacing pandas all together. Version 0.1 and before required rapid development, and pandas was the right tool for the job at the time. It's no longer the right tool. However, this change might be beyond scope of this specific issue.

Of the various package options you listed, do you have a sense of how we should decide?

SQlite

I appreciate how you've broken down your response to these three core subtitles. You've really hit the nail on the head in terms of broadening issues with the code base! Perhaps even a more dire need than replacing pandas, is replacing sqlite. See some notes: cytomining/cytominer-database#96 and cytomining/cytominer-database#124.

The solution for replacing SQLite likely resides in a separate python package https://github.com/cytomining/cytominer-database or https://github.com/cytomining/cytominer-transport

Ideally, pycytominer should be able to handle multiple data input types.

In addition to Vaex, I'd also like to point to https://github.com/scverse/anndata as another option; one that might be closer to our data-specific scientific needs.

Next steps

To this effect, a preparatory or concluding conversion might be something to consider

Please consider my comments above, and then determine how you'd wish to proceed. I'd love to continue our discussion.

We should be cognizant that this tool is used by other labs, and any discussions we have should be summarized back on github.

Thanks!

d33bs commented 2 years ago

Thank you for the great feedback and thoughts @gwaygenomics! There's a ton of awesome work in this repo, including the Pandas data management aspects. The pd.dataframe.rename and pd.dataframe.set_index areas are I think specific to Pandas and how it manages data. I'm uncertain whether the non-unique, multi-, and potential additional re-indexing will increase performance. Understanding that we may move away from Pandas and SQLite, this may also not be beneficial long-term.

There's a lot to consider with a dataframe management library change or migration. My feeling is that the choice would be best made based on the type of data involved, the environment in which pycytominer typically runs (resources, OS, etc). That said, understanding what SQLite replacement(s) will be used will help navigate the change.

For the purposes of this specific issue, I plan to make a PR with the pd.dataframe.rename update to keep our changes here small and practical for the time being. I'd like to continue to assist with the dataframe and data format topics in more isolated issues, as I agree these are a bit outside the scope here.

Further Thoughts on SQLite Replacement(s)

Appreciate you sharing the https://github.com/cytomining/cytominer-database and https://github.com/cytomining/cytominer-transport repos. Parquet can offer fantastic performance and compression of large datasets - it looks like there's some work already going on there towards this effect. https://github.com/scverse/anndata also looks interesting and may offer improvements here.

Many dataframe and other data management technologies are beginning to adopt Apache Arrow internally or as a deliverable. Because of Arrow's cross-compatibility and performance capabilities it opens many options for data management which may not have been possible before. It also provides some flexibility in dependency selection now and in the future.

Arrow might offer the capability to keep the data format compatible across various file-based and in-mem formats while navigating incremental change (see below).

flowchart LR
    sqlite[(SQLite Database)] --> |conversion required| arrow[Arrow Data]
    dataset[(Arrow-compatible\nDataset)] --> |direct| arrow[Arrow Data]
    arrow --> Pandas.DataFrame
    arrow --> Dataframe.X
    arrow --> Others["Others (language agnostic)"]
    Pandas.DataFrame --> |deprecate| Pycytominer.work
    Dataframe.X --> |future-facing\nfor performance| Pycytominer.work
    Others --> |integrations or\nadditional fxnality?| Pycytominer.work

Thinking about Arrow's possibilities here and the SQLite file format more generally (a single-file RDBMS) made me think DuckDB might be useful here. This might balance portability with performance and the dependency flexibility I mentioned.

Welcome your thoughts with all the above, and excited to continue work towards these efforts!

gwaybio commented 2 years ago

Thanks @d33bs - I feel you have a very strong grasp on several potential directions we could take, and perhaps the next step is to define two deprecation strategies for pandas and sqlite.

We and others currently use pycytominer version 0.1 in many projects, and in projects pre-0.1, we used good versioning hygiene with github hashs. Therefore, once we define some deprecation strategies, we should be all set to begin this development on a develop branch. Happy to have insights into your intuition here as well @d33bs !

A couple other miscellaneous thoughts:

I like the idea of a sqilte db conversion to arrow data (this will ensure that future pycytominer versions can analyze legacy data). I also like the idea of keeping this conversion inside pycytominer itself. One might be tempted to make another tool, but keeping this tied to the other processing methods likely keeps it out of graveyard longer.
- But maybe there is another, externally maintained tool that handles this exact issue... 🤔
DuckDB looks interesting. I wonder if it has as much staying power as pyarrow though. We don't want to be hurt by time invested into a graveyarded project again! (e.g. https://github.com/cytomining/cytominer-database/pull/108/files#diff-6afc58ffaa6391c78bae97546537dee30b08231df4e371ff22f763a6b26a0b5cL101)

What would a helpful next step be?

d33bs commented 2 years ago

Great points @gwaygenomics ! I like the idea of working on a develop branch together to resolve some of these things. Sorry to hear about odo, it can be tough with data these days - things change quick! 🙂

Would it make sense to point #197 to this develop branch instead, so as to collect releasable changes?

I think along with what's already been mentioned we could focus on some general repo health enhancements as part of the develop work. Things like linting, expanded testing guidance, benchmark standardization, etc. will I feel help improve the endurance of the code and our ability to work quickly.

As next steps, how about the following?

[ ] Create dedicated issue for addressing the SQLite reads (mentioning this and other existing issues as applicable)
[ ] Create dedicated issue for addressing dataframe scalability (mentioning this and other existing issues as applicable)
[ ] Create dedicated issue(s) for addressing repo health

My instinct is to work towards the SQLite file reads first, then work on the dataframe scalability. It may turn out that these are too tightly wound together to decouple, and maybe one will determine or resolve the other, or it would make sense to work the other way around.

gwaybio commented 2 years ago

Would it make sense to point https://github.com/cytomining/pycytominer/pull/197 to this develop branch instead, so as to collect releasable changes?

I think it is ok to merge into master prior, so let's leave it as is.

In regards to strategy, yes! I 💯 concur with the plan you've outlined. Three separate issues, SQLite first, determine how tightly wound SQLite and pandas are (my intuition is that they are not as tightly coupled as you might initially guess), and then start developing! 🛠️

Thank you!

d33bs commented 2 years ago

Hi @gwaygenomics - I've begun adding some issues as per our conversations. I don't seem to have access to add branches to this repo, could I ask for your or someone who has access to create the develop branch (or similar)? Thanks in advance!

gwaybio commented 2 years ago

Great! I just sent an invite to collaborate on this repo. You should have access now

d33bs commented 2 years ago

@gwaygenomics Thank you! I now have the access I need for the branch.

shntnu commented 2 years ago

It's a delight to read all the thoughts and ideas that have gone into this issue!

Do I understand correctly that the plan here is to undertake the big overhaul (outlined in https://github.com/cytomining/pycytominer/issues/195#issuecomment-1133329151), and that in turn will address this issue (with .merge_single_cells()?

Can anyone comment on whether we already have a proposal for a stopgap solution to the original problem reported in this issue? If so, I can look around for help to implement that solution. I read through the issue, but I couldn't figure that out.

gwaybio commented 2 years ago

Can anyone comment on whether we already have a proposal for a stopgap solution to the original problem reported in this issue?

We do not currently have a stopgap solution, nor is anyone (to my knowlege at least) working on finding one. In other words, our only approach right now is to sidestep with big overhaul

If so, I can look around for help to implement that solution.

That sounds great!

shntnu commented 2 years ago

We do not currently have a stopgap solution, nor is anyone (to my knowlege at least) working on finding one. In other words, our only approach right now is to sidestep with big overhaul

Thanks for clarifying

@MarziehHaghighi had created a set of notebooks that capture her typical use cases for working with single-cell data

https://github.com/broadinstitute/SingleCell_Morphological_Analysis#reading-the-whole-or-a-subset-of-an-input-sqlite-file

Specifically, this is how she loads all the single cell data from a SQLite file

https://github.com/broadinstitute/SingleCell_Morphological_Analysis/blob/3bedae9cb7db199ae24deecd27adf45295321058/utils/read_data.py#L183-L203

@MarziehHaghighi do you recollect how long it takes to load a typical ~10GB SQLite file, as well as the memory requirements, using readSingleCellData_sqlalch?

gwaybio commented 2 years ago

I'm fairly confident that @bunnech's #219 solves the issue described in the title. However, I think that some notes here are worth preserving (in someplace other than a closed issue).

Therefore, we could either:

Change the title of this issue
Move some elements of this issue elsewhere to preserve these notes

@d33bs and/or @axiomcura - Do you have any intuitions here?

d33bs commented 2 years ago

Hi @gwaybio - very excited about #219 (amazing work @bunnech)!

I'd be in favor of closing the issue and continuing to reference it where appropriate. Below are some ideas for new issues based on the conversations here or in related PR's. Open to everyone's thoughts here.

Add Memory/Time Testing Constraints for .merge_single_cells() (or others! 🙂 ) (Something like pytest-memray or similar): automate targeted performance tests to ensure operations within this project are usable by others as appropriate. Maybe this one spot was an outlier, but if there are other performance-heavy functions we could expand there as well.
Document Data Format Schema (Tables, Columns, Datatypes, etc. per Data Source): outline as described, including notes about commonly used compartments and areas which may be dynamic/flexible in nature.
Provide DataFrame Data Type Validation Testing: Pycytominer research developers may benefit from more formal datatype validation for in-memory deliverables. This might span both column (Pandas dtypes) and value types (for ex. "missing values" such as np.nan vs pd.NA vs None). These could be treated separately and/or on targeted deliverables. Some of this is I think outlined by #79 and maybe other issues.
Add Example Notebooks or Code for Pycytominer: I was inspired by @shntnu's highlight of @MarziehHaghighi's work as part of discussing a solution to this issue. I feel like some examples could help kickstart collaborator work and troubleshoot when needed in the future.
Address In-memory Dataset Scalability: To the comments regarding the possible use of Arrow or other libraries in the future - there's great opportunity here. Pandas may not scale well with large datasets and may be cause for barriers.
Standardize Performance Issue Intake (or others!): @axiomcura did a fantastic job creating and investigating this issue. The project could benefit from templating or standard description of issue reports based on what spurred the conversations here (for example, machine/environment details, code samples, etc). For ex., see Pandas Github issue templates.

axiomcura commented 2 years ago

I agree with @d33bs

Many ideas were expressed in this issue that can be their own separate tasks. We should close this issue once the #219 is merged! Thank you @bunnech for your amazing work!

gwaybio commented 2 years ago

Thanks for documenting all the separate threads discussed in this project @d33bs and @axiomcura - I've created six separate issues (and linked relevant ones).

I think that it is safe to close this issue now, and reserve discussion on these topics in their respective threads.

axiomcura commented 2 years ago

Small update about the changes that were applied in #219

I ran a memory profile with the same code explained here to measure runtime and memory usage

Total runtime was 8 minutes for a 10 GB file:

real    7m42.803s
user    6m34.286s
sys     0m59.974s

Generated memory profile:

Filename: merge_cells_mem_test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     3  163.305 MiB  163.305 MiB           1   @profile
     4                                         def run():
     5  163.305 MiB    0.000 MiB           1       sql_path = "SQ00014614.sqlite"
     6  163.305 MiB    0.000 MiB           1       sql_url = f"sqlite:///{sql_path}"
     7                                         
     8  309.531 MiB  146.227 MiB           2       sc_p = SingleCells(
     9  163.305 MiB    0.000 MiB           1           sql_url,
    10  163.305 MiB    0.000 MiB           1           strata=["Image_Metadata_Plate", "Image_Metadata_Well"],
    11  163.305 MiB    0.000 MiB           1           image_cols=["TableNumber", "ImageNumber"],
    12  163.305 MiB    0.000 MiB           1           fields_of_view_feature=[],
    13                                             )
    14                                         
    15 10758.113 MiB 10448.582 MiB           1       return sc_p.merge_single_cells()

The resulting output from merge_single_cells() possesses similar file size as the sqlite input indicating that no memory leakage is occurring! Overall this a great improvement!

Additional info

Memory profiler used

Machine Information

OS: PopOS! 20.01 LTS
CPU: AMD® Ryzen threadripper 2970wx 24-core processor × 48
GPU: 2x NVIDIA GeForce RTX 3060
Memory: 64 GB RAM

shntnu commented 2 years ago

Thanks a lot for reporting this @axiomcura (ps - I love the perfect level of detail in your issues!)

That's mindblowing -- bravo (again) @bunnech 🎉 !

I'll head over to #205 for a related topic.

cytomining / pycytominer