add a configurable chunksize whenever loading

bethac07 commented 3 years ago

We have seen in pycytominer that using a chunksize parameter on pandas.read_sql averts runaway memory consumption errors; we are hoping doing so here for read_csv has the same effect!

A new read_csvs_with_chunksize function is added her to io_utils, and all other pd.read_csv calls are replaced with it. Right now we are not actually configuring what size chunks to use anywhere, but I've left it configurable in case in future user testing we find it works better with different sizes for different calls.

I've also explicitly not done any error handling other than raising, simply because in various places in the script sometimes even the same error class is sometimes breaking for the recipe and sometimes not, and I thought we should continue to just support the logic where it currently is.

bethac07 commented 3 years ago

@gwaygenomics your requested changes are made. It would be great to get you or @MerajRamezani or @mlozada21 or anyone else who runs the recipe to test this to see if things are sad before we pull, because I currently have not been able to test this.

RE: Always using it- I tested reading on a 2080 row by 1023 column CSV; reading with no chunk size at all took 0.402s, with a chunk size of 3000 (aka bigger than the file) it took 0.404s, a size of 1000 it took ~0.5s, with a size of 100 it took 1.3 seconds. So with the proposed default of 10K chunksize, we should see no performance hit at all on smallish files, it should just be a "guardrail" for really big files. I think we could probably safely drop it down smaller than 10K as the default, but that's the sort of thing I think would be useful to do some testing with.

bethac07 commented 3 years ago

For funsies, timeit of using this on a 47K row vs 1K column CSV

no chunking - 7.94 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) chunksize 100 - 32 s ± 1.08 s per loop (mean ± std. dev. of 7 runs, 1 loop each) chunksize 1K - 10.3 s ± 360 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) chunksize 10K - 8.57 s ± 554 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

MerajRamezani commented 3 years ago

great to see this enhancement. I won't be able to play around with this until next week (will need to play catch up after my vacation ends) but I'll be happy to test it out.

But @MerajRamezani maybe you'll get to this before me? It should be a fairly simple test - try running CPXXX well level profiles (we don't talk about numbers in this public repo!) but go into the recipe, git pull, and git checkout chunksize (the branch @bethac07 is editing - then try rerunning one of the profiling steps, maybe aggregation)

Also, @bethac07 for that timeit test, can you add a code block of what you did to generate those numbers?

@gwaygenomics @bethac07 I can give it a try and update you on how it goes by the end of the week.

bethac07 commented 3 years ago

Yup, good catch.

RE: timeit , within jupyter -

Cell1 - configuration

import pandas as pd
filename = 'path/to/large/file.csv'
chunksize = 100

Cell 2- test without chunksize

%%timeit 
df = pandas.read_csv(filename)

Cell 3- test with chunksize (which can be configured in Cell 1

%%timeit
with pandas.read_csv(filename,chunksize=chunksize) as reader:
    dflist=[]
    for chunk in reader:
        dflist.append(chunk)
    df = pandas.concat(dflist)

MerajRamezani commented 3 years ago

@bethac07 @gwaygenomics I tried to run the updated branch but I am getting this error:

Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3081, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'level_3'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "recipe/0.preprocess-sites/4.image-and-segmentation-qc.py", line 511, in cp_sat_df[["cat", "type", "Ch"]] = cp_sat_df["level_3"].str.split( File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in getitem indexer = self.columns.get_loc(key) File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3083, in get_loc raise KeyError(key) from err KeyError: 'level_3' Config file set to perform=False, not performing step: profile--single_cell Traceback (most recent call last): File "recipe/1.generate-profiles/1.aggregate.py", line 14, in from io_utils import read_csvs_with_chunksize ModuleNotFoundError: No module named 'io_utils' Traceback (most recent call last): File "recipe/1.generate-profiles/2.normalize.py", line 13, in from io_utils import read_csvs_with_chunksize ModuleNotFoundError: No module named 'io_utils' Traceback (most recent call last): File "recipe/1.generate-profiles/3.feature-select.py", line 13, in from io_utils import read_csvs_with_chunksize ModuleNotFoundError: No module named 'io_utils'

bethac07 commented 3 years ago

I think I see the issue and pushed a patch, can you pull and try again? Thanks!

On Wed, Jul 14, 2021 at 3:23 PM MerajRamezani @.***> wrote:

@bethac07 https://github.com/bethac07 @gwaygenomics https://github.com/gwaygenomics I tried to run the updated branch but I am getting this error:

Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3081, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'level_3'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "recipe/0.preprocess-sites/4.image-and-segmentation-qc.py", line 511, in cp_sat_df[["cat", "type", "Ch"]] = cp_sat_df["level_3"].str.split( File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in getitem indexer = self.columns.get_loc(key) File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3083, in get_loc raise KeyError(key) from err KeyError: 'level_3' Config file set to perform=False, not performing step: profile--single_cell Traceback (most recent call last): File "recipe/1.generate-profiles/1.aggregate.py", line 14, in from io_utils import read_csvs_with_chunksize ModuleNotFoundError: No module named 'io_utils' Traceback (most recent call last): File "recipe/1.generate-profiles/2.normalize.py", line 13, in from io_utils import read_csvs_with_chunksize ModuleNotFoundError: No module named 'io_utils' Traceback (most recent call last): File "recipe/1.generate-profiles/3.feature-select.py", line 13, in from io_utils import read_csvs_with_chunksize ModuleNotFoundError: No module named 'io_utils'

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/pooled-cell-painting-profiling-recipe/pull/78#issuecomment-880148783, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTI724AVDFU2EG57CF6733TXXP23ANCNFSM5ADD5AUA .

-- Beth Cimini, PhD Senior Group Leader & CZI Imaging Scientist Imaging Platform, Broad Institute 415 Main St Room 5011 Cambridge, MA 02142 Current office number- (617) 714-8189 Pronouns - She/her/hers I will sometimes send or respond to emails outside of my local office hours, but I never expect responses outside of your local office hours.

MerajRamezani commented 3 years ago

Great, I have started a new test and it seems to be running smoothly for now. I am going to update you as soon as I have the results.

MerajRamezani commented 3 years ago

@bethac07 @gwaygenomics The new code failed to aggregate profiles on a m5.4xlarge machine but ran without an issue on a m5.24xlarge machine. I can test it on another machine as well if relevant(let me know a specific one!). Here is the log and where it failed:

Config file set to perform=False, not performing step: preprocess--prefilter Config file set to perform=False, not performing step: preprocess--process-spots Config file set to perform=False, not performing step: preprocess--process-cells Config file set to perform=False, not performing step: preprocess--summarize-cells /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_cells_count_per_well_CP228A.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_cells_count_per_well_CP228B.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_ratios_per_well_CP228A.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_percent_empty_cells_per_well_CP228A.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_ratios_per_well_CP228B.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_percent_empty_cells_per_well_CP228B.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_BC_to_CP_DAPI_correlation_per_well_CP228A.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_BC_to_CP_DAPI_correlation_per_well_CP228B.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_Cells_FinalThreshold_per_well_CP228A.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_Cells_FinalThreshold_per_well_CP228B.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_Nuclei_FinalThreshold_per_well_CP228A.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_Nuclei_FinalThreshold_per_well_CP228B.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_PercentConfluent_per_well_CP228A.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/plate_layout_PercentConfluent_per_well_CP228B.png exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/results/sites_with_confluent_regions.csv exists, overwriting /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/PLLS_per_well_CP228A.png exists, overwriting /home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/plotnine/layer.py:401: PlotnineWarning: geom_text : Removed 360 rows containing missing values. /home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/Meraj_analysis/CP228W/recipe/scripts/io_utils.py:11: UserWarning: data/0.site-qc/20210124_6W_CP228/figures/PLLS_per_well_CP228B.png exists, overwriting /home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/plotnine/layer.py:401: PlotnineWarning: geom_text : Removed 360 rows containing missing values. Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3081, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'level_3'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "recipe/0.preprocess-sites/4.image-and-segmentation-qc.py", line 511, in cp_sat_df[["cat", "type", "Ch"]] = cp_sat_df["level_3"].str.split( File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in getitem indexer = self.columns.get_loc(key) File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3083, in get_loc raise KeyError(key) from err KeyError: 'level_3' Config file set to perform=False, not performing step: profile--single_cell Loading one single cell file: data/1.profiles/20210124_6W_CP228/single_cell/20210124_6W_CP228_single_cell_profiles_20210124_6W_CP228ALLPLATESperfectset1.csv.gz Config file set to perform=False, not performing step: profile--normalize

bethac07 commented 3 years ago

Well, I'm glad it didn't make anything EXCESSIVELY WORSE, at minimum! Did you get any sense of if it seemed faster or slower than expected based on your experience doing the same steps with the master branch, or too hard to say?

I'm not sure what specifically that error might mean in this context.

We probably want to look into this more carefully- briefly, we can set up a machine with the CloudWatch Agent installed and configured to do memory profiling at a pretty tight interval (30 seconds?), then run successive tests on the same data set with this branch and on master. I'm happy to help get such a machine set up some day next week if someone wants to put a meeting on my calendar, though i certainly need not be involved if someone else is motivated and wants to dig in further.

MerajRamezani commented 3 years ago

I don't think the update has any major side effects. It is hard to be quantitative about it but it definitely felt faster than before ( hard to be sure without testing it).

The traceback error happened on the other machine as well but it is not stopping the process so probably not very serious. Memory still seems to be the issue but the update might have improved the situation. More careful comparison can help to measure any possible improvements.

It is a good idea to get together with @gwaygenomics and setup the machine as you suggested to dig a little bit deeper. At least by figuring out the requirements we can optimize the type of machine setup for the task.

gwaybio commented 3 years ago

Thanks both - I've sent you calendar invites for tomorrow to discuss memory and alarm issues.

gwaybio commented 3 years ago

Also, I think we're good to merge this in, right @MerajRamezani - no errors on your side?

bethac07 commented 3 years ago

If we're going to memory profile this tomorrow, I would say let's wait for the results - I've got a strong suspicion that this will help, but I'd rather close unmerged if it turns out to hurt.

gwaybio commented 3 years ago

sounds good!

bethac07 commented 3 years ago

Per @MerajRamezani - branch (lower graph) knocks a couple of minutes off the processing time AND decreases memory usage by about 20% (42/54 ~= 0.8) compared to master (upper graph). Not a slam dunk, but better than before!

We may or may not want to test this with a lower chunksize going forward, we may see even further gains, though likely not the only optimization we need to do. Master: Branch:

gwaybio commented 3 years ago

@bethac07 @MerajRamezani I'm about to weld the next data - good to merge this?

broadinstitute / pooled-cell-painting-profiling-recipe

add a configurable chunksize whenever loading #78