DistributedScience / Distributed-CellProfiler

Run encapsulated docker containers with CellProfiler in the Amazon Web Services infrastructure.
https://distributedscience.github.io/Distributed-CellProfiler/
Other
37 stars 24 forks source link

Very slow CellProfiler pipeline start up times #170

Closed gareth-rogers-healx closed 2 months ago

gareth-rogers-healx commented 2 months ago

Hi

I'm attempting to read data from a bucket which is managed by Signals Image Artist (SImA) and is storing our raw images. SImA has a structure of <bucket>/<dataset-uuid>/<images>. This means that this section of code is literally taking 6hrs to list all the data in within the top level directories. Unlike the more nested JUMP data structure.

I have commented out this line to see if I could speed things up. Frustratingly whether I process one well-site or a whole plate it is still taking >6hrs for the job to start.

For example, if I hack the container to run forever and not to look in the queue, then running cellprofiler in tmux with the added -L DEBUG I can see:

JVM will be started with AWT in headless mode
Creating JVM object
Signalling caller
Enabled Bio-formats directory cacheing
Your pipeline version is 426 but you are running CellProfiler version 424. Loading this pipeline may fail or have unpredictable results.
Loaded backend qt5agg version unknown.
Loaded backend wxagg version unknown.
Loaded backend agg version unknown.
Loaded backend agg version unknown.

at this point nothing will happen for 6hrs or so, then I get:

HDF5Dict.__init__(): /tmp/Cpmeasurementso9qds74e.hdf5, temporary=True, copy=None, mode=w
Creating converter from 5 to 3
HDF5Dict.flush(): /tmp/Cpmeasurementso9qds74e.hdf5, temporary=True
Creating converter from 3 to 5
HDF5Dict.flush(): /tmp/Cpmeasurementso9qds74e.hdf5, temporary=True
Times reported are CPU and Wall-clock times for each module
Getting image reader for: OrigDNA, None, file:/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/images/fda_screen/images/BR15538075/r02c02f01p-ch1.tiff
Falling back to Java reader.
Getting image reader for: None, None, file:/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/images/fda_screen/images/BR15538075/r02c02f01p-ch1.tiff
Getting image reader for: OrigER_RNA, None, file:/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/images/fda_screen/images/BR15538075/r02c02f01p-ch2.tiff
Falling back to Java reader.
Getting image reader for: None, None, file:/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/images/fda_screen/images/BR15538075/r02c02f01p-ch2.tiff
Getting image reader for: OrigMito, None, file:/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/images/fda_screen/images/BR15538075/r02c02f01p-ch4.tiff
Falling back to Java reader.
Getting image reader for: None, None, file:/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/images/fda_screen/images/BR15538075/r02c02f01p-ch4.tiff
Getting image reader for: OrigActin, None, file:/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/images/fda_screen/images/BR15538075/r02c02f01p-ch5.tiff
Falling back to Java reader.
Getting image reader for: None, None, file:/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/images/fda_screen/images/BR15538075/r02c02f01p-ch5.tiff
Getting image reader for: OrigGPM, None, file:/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/images/fda_screen/images/BR15538075/r02c02f01p-ch3.tiff
Falling back to Java reader.
Getting image reader for: None, None, file:/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/images/fda_screen/images/BR15538075/r02c02f01p-ch3.tiff
Mon Jun 10 19:05:26 2024: Image # 1, module LoadData # 1: CPU_time = 3.31 secs, Wall_time = 3.73 secs
Mon Jun 10 19:05:30 2024: Image # 1, module Resize # 2: CPU_time = 0.02 secs, Wall_time = 0.03 secs
Mon Jun 10 19:05:30 2024: Image # 1, module CorrectIlluminationCalculate # 3: CPU_time = 0.00 secs, Wall_time = 0.00 secs
/opt/venv/lib/python3.8/site-packages/skimage/filters/rank/generic.py:262: UserWarning: Bad rank filter performance is expected due to a large number of bins (16739), equivalent to an approximate bitdepth of 14.0.
  image, footprint, out, mask, n_bins = _preprocess_input(image, footprint,
Mon Jun 10 19:05:30 2024: Image # 1, module Resize # 4: CPU_time = 0.09 secs, Wall_time = 0.09 secs
Mon Jun 10 19:05:30 2024: Image # 1, module Resize # 5: CPU_time = 0.03 secs, Wall_time = 0.03 secs
Mon Jun 10 19:05:30 2024: Image # 1, module CorrectIlluminationCalculate # 6: CPU_time = 0.00 secs, Wall_time = 0.00 secs
/opt/venv/lib/python3.8/site-packages/skimage/filters/rank/generic.py:262: UserWarning: Bad rank filter performance is expected due to a large number of bins (31112), equivalent to an approximate bitdepth of 14.9.
  image, footprint, out, mask, n_bins = _preprocess_input(image, footprint,
....

and the above is with an S3FS mount I created for a different bucket with some test data so it isn't as large as our SImA managed bucket incase you notice the non-standard /home/ubuntu/destination_bucket path. Still it's taking a crazy length of time and it's true with the standard DCP setup.

Once cellprofiler is running that job took a minute as it was just 1 well-site worth of input.

I have copied the data for a well-site into a local directory and without going via the S3FS mount cellprofiler starts processing the data pretty much instantly.

Do you have any idea what could be happening?

I apologise this is bit of a scattered report, I've run a few too many tests with different errors and then take a working day to run each one! I can provide more detailed timings, I keep forgetting to add time :(

Thanks Gareth

ErinWeisbart commented 2 months ago

Are you comfortable making your own DCP docker? If so you could try commenting out the bucket listing.

(Unfortunately, I'm not sure anyone currently maintaining DCP knows the history of this particular code block or the "strange S3FS error" it's meant to get around. It's on my list to see if I can track it down, but I have no time estimate when that would make it into an official DCP version or Docker.)

bethac07 commented 2 months ago

Can you let us know the exact call you're using (shoudl be part of the logs, ie cellprofiler -c -r -p etc) and how big your file list is being passed in? Also, is there any change if you use DOWNLOAD_FILES or not? The latter would help distinguish an S3FS issue from a CellProfiler issue

gareth-rogers-healx commented 2 months ago

Thanks for getting back to me. I think I understand the cause of the problem.

My setup, I have built the image myself and for these tests run-worker.sh enters an infinite while loop before calling cp-worker.py. I can use SSM to connect to the EC2 instance and container, there's only one node in my cluster for these tests, and run cellprofiler directly. It speeds up the testing a lot!

I realised after writing the above (at the end of my day) what the likely problem. I've run three more tests:

time cellprofiler \
  -c \
  -r \
  -p /home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/software/cellprofiler-pipelines/2024_fda_screen_v1_illumination.cppipe \
  -i /home/ubuntu/source_bucket/dataset-4e08bce4-3d49-11ee-99b9-02420a00022c/ \
  -o /home/ubuntu/local_output/sima/BR15538075/1-1/ \
  -d /home/ubuntu/local_output/sima/BR15538075/1-1/cp.is.done \
  --data-file=/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/load_data/fda_screen/BR15538075/load_data_triin_from_sima_1.csv \
  -g Metadata_Plate=15538075 \
  -L DEBUG > >(tee /home/ubuntu/local_output/sima/BR15538075/1-1/stdout.log) 2> >(tee /home/ubuntu/local_output/sima/BR15538075/1-1/stderr.log)

which took 18s

real    0m18.448s
user    0m4.403s
sys     0m0.712s
time cellprofiler \
  -c \
  -r \
  -p /home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/software/cellprofiler-pipelines/2024_fda_screen_v1_illumination.cppipe \
  -i /home/ubuntu/local_output/ \
  -o /home/ubuntu/local_output/sima/BR15538075/1-3/ \
  -d /home/ubuntu/local_output/sima/BR15538075/1-3/cp.is.done \
  --data-file=/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/load_data/fda_screen/BR15538075/load_data_triin_from_sima_1.csv \
  -g Metadata_Plate=15538075 \
  -L DEBUG > >(tee /home/ubuntu/local_output/sima/BR15538075/1-3/stdout.log) 2> >(tee /home/ubuntu/local_output/sima/BR15538075/1-3/stderr.log)

which took 4s

real    0m4.525s
user    0m4.381s
sys     0m0.492s

and

time cellprofiler \
  -c \
  -r \
  -p /home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/software/cellprofiler-pipelines/2024_fda_screen_v1_illumination.cppipe \
  -i /home/ubuntu/source_bucket/ \
  -o /home/ubuntu/local_output/sima/BR15538075/1-2/ \
  -d /home/ubuntu/local_output/sima/BR15538075/1-2/cp.is.done \
  --data-file=/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/load_data/fda_screen/BR15538075/load_data_triin_from_sima_1.csv \
  -g Metadata_Plate=15538075 \
  -L DEBUG > >(tee /home/ubuntu/local_output/sima/BR15538075/1-2/stdout.log) 2> >(tee /home/ubuntu/local_output/sima/BR15538075/1-2/stderr.log)

which is still running after >2.5hrs.

The difference is the input directory, -i, in order:

I have been 'lazy' as I don't understand how the input bucket is used, I've never got relative paths to work in the load_data.csv I've been leaving it as the input directory, the mounted bucket. Until these tests that bucket has been a test bucket with a small but growing amount of data. The probably probably got lost in the noise when processing a plate's worth of data with a well-site level grouping.

I guess CellProfiler is doing a recursive list of the input directory, or something like that, before it processes the load_data.csv paths. Using directories that have much faster file listings seems to bring the runtime back to something reasonable.

Does any of that make sense?

In the longer term I will build without the bucket directory listing and hope that S3FS fixed what ever weird behaviour that was solving. The build is using the latest version of S3FS

bethac07 commented 2 months ago

Yup, if you pass CellProfiler a -i it assumes you want it to check that folder for more image sets, so good to set it thoughtfully. We do have that note in our Troubleshooting documentation but happy for documentation PRs if there are other places you think it might help!

image
gareth-rogers-healx commented 2 months ago

Thanks @bethac07, in the troubleshooting is excellent. You can't help people who don't read the documentation by documenting in more places :P

In my defence I have read a lot of the documentation, I forgot to check the troubleshooting this time although I do know it exists 🙈

Thanks!