[Enhancement] Improve performance/scaling by enabling an optional install of modin.

kenibrewer commented 1 year ago

Certain pycytominer functions run into scaling issues for extremely large datasets. This can be solved by enabling an optional install of modin which is a drop-in replacement for pandas that leverages either ray, dask, or unidist backend engines to automatically shard data and parallelize operations across many cpus or instances.

When I first started using pycytominer, I didn't realize that feature-selection was supposed to happen against well-aggregated profiles and I attempted to run feature-selection on a profiles dataframe with millions of rows. I let the feature_selection step run for 4 hours before killing it. To solve the "problem" I forked pycytominer and did simple "import modin.pandas as pd" replacements in all the files. When I tried to re-run feature_selection it completed in under 60 seconds after it properly scaled to all 16 cpus of the instance I was using. I later realized my mistake in approach and left the branch by the wayside.

There were some problems with my quickly hacked together solution (especially with annotate) and there were problems with certain functions that manually detect datatypes (e.g. if type(obj) == pd.Dataframe checks). But it may be worth systematically tackling those issues to get this functionality to work.

gwaybio commented 1 year ago

sounds great @kenibrewer! I think this would be a very much welcomed enhancement.

We have also considered adding pyarrow support (via pandas 2.0.1 https://pandas.pydata.org/docs/user_guide/pyarrow.html#i-o-reading)

Do you have any thoughts on how to proceed with this development? For example, maybe we could use a specific develop branch (e.g. develop-parallel) and all work toward this goal could be placed there?

I think this development is also quite timely, given the single-cell emphasis @d33bs and others are implementing over in CytoTable. https://github.com/cytomining/CytoTable

d33bs commented 1 year ago

I think Modin could be very potent depending on how it's used within Pycytominer. One thing I recommend keeping in mind is the variance of support for especially the pd.DataFrame API through Modin. Many methods have only Ray or Dask support, or somewhat different implementations depending on the engine which is used. Some of the implementations might also involve rewriting existing code to match the expectations of Modin (or Dask / Ray)(for example, see merge on this page).

Pandas over Arrow perspective: to my knowledge based on some digging, there's no easy "drop-in" replacement for arrow with Pandas dataframes right now. My hope was there'd be an option, something like pd.options.dtype_backend = "pyarrow". This doesn't seem to exist yet.

Instead this appears to be dataframe-to-dataframe via Pandas reader/writer settings, meaning we could do something like overload readers and writers with new defaults for dtype_backend or engine. This feels complex, and is I feel likely to evolve in the near future.

Some related links below:

cytomining / pycytominer

[Enhancement] Improve performance/scaling by enabling an optional install of modin. #281