Closed gareth-rogers-healx closed 2 months ago
Are you comfortable making your own DCP docker? If so you could try commenting out the bucket listing.
(Unfortunately, I'm not sure anyone currently maintaining DCP knows the history of this particular code block or the "strange S3FS error" it's meant to get around. It's on my list to see if I can track it down, but I have no time estimate when that would make it into an official DCP version or Docker.)
Can you let us know the exact call you're using (shoudl be part of the logs, ie cellprofiler -c -r -p etc
) and how big your file list is being passed in? Also, is there any change if you use DOWNLOAD_FILES or not? The latter would help distinguish an S3FS issue from a CellProfiler issue
Thanks for getting back to me. I think I understand the cause of the problem.
My setup, I have built the image myself and for these tests run-worker.sh
enters an infinite while loop before calling cp-worker.py
. I can use SSM to connect to the EC2 instance and container, there's only one node in my cluster for these tests, and run cellprofiler
directly. It speeds up the testing a lot!
I realised after writing the above (at the end of my day) what the likely problem. I've run three more tests:
time cellprofiler \
-c \
-r \
-p /home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/software/cellprofiler-pipelines/2024_fda_screen_v1_illumination.cppipe \
-i /home/ubuntu/source_bucket/dataset-4e08bce4-3d49-11ee-99b9-02420a00022c/ \
-o /home/ubuntu/local_output/sima/BR15538075/1-1/ \
-d /home/ubuntu/local_output/sima/BR15538075/1-1/cp.is.done \
--data-file=/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/load_data/fda_screen/BR15538075/load_data_triin_from_sima_1.csv \
-g Metadata_Plate=15538075 \
-L DEBUG > >(tee /home/ubuntu/local_output/sima/BR15538075/1-1/stdout.log) 2> >(tee /home/ubuntu/local_output/sima/BR15538075/1-1/stderr.log)
which took 18s
real 0m18.448s
user 0m4.403s
sys 0m0.712s
time cellprofiler \
-c \
-r \
-p /home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/software/cellprofiler-pipelines/2024_fda_screen_v1_illumination.cppipe \
-i /home/ubuntu/local_output/ \
-o /home/ubuntu/local_output/sima/BR15538075/1-3/ \
-d /home/ubuntu/local_output/sima/BR15538075/1-3/cp.is.done \
--data-file=/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/load_data/fda_screen/BR15538075/load_data_triin_from_sima_1.csv \
-g Metadata_Plate=15538075 \
-L DEBUG > >(tee /home/ubuntu/local_output/sima/BR15538075/1-3/stdout.log) 2> >(tee /home/ubuntu/local_output/sima/BR15538075/1-3/stderr.log)
which took 4s
real 0m4.525s
user 0m4.381s
sys 0m0.492s
and
time cellprofiler \
-c \
-r \
-p /home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/software/cellprofiler-pipelines/2024_fda_screen_v1_illumination.cppipe \
-i /home/ubuntu/source_bucket/ \
-o /home/ubuntu/local_output/sima/BR15538075/1-2/ \
-d /home/ubuntu/local_output/sima/BR15538075/1-2/cp.is.done \
--data-file=/home/ubuntu/destination_bucket/integration-tests/source_integration_tests/workspace/load_data/fda_screen/BR15538075/load_data_triin_from_sima_1.csv \
-g Metadata_Plate=15538075 \
-L DEBUG > >(tee /home/ubuntu/local_output/sima/BR15538075/1-2/stdout.log) 2> >(tee /home/ubuntu/local_output/sima/BR15538075/1-2/stderr.log)
which is still running after >2.5hrs.
The difference is the input directory, -i
, in order:
/home/ubuntu/source_bucket/dataset-4e08bce4-3d49-11ee-99b9-02420a00022c/
the directory where the plate's images are stored - 18s
;/home/ubuntu/local_output/
a local directory, not mounted and basically empty - 4s
; and/home/ubuntu/source_bucket/
the whole input bucket - still running.I have been 'lazy' as I don't understand how the input bucket is used, I've never got relative paths to work in the load_data.csv
I've been leaving it as the input directory, the mounted bucket. Until these tests that bucket has been a test bucket with a small but growing amount of data. The probably probably got lost in the noise when processing a plate's worth of data with a well-site level grouping.
I guess CellProfiler is doing a recursive list of the input directory, or something like that, before it processes the load_data.csv
paths. Using directories that have much faster file listings seems to bring the runtime back to something reasonable.
Does any of that make sense?
In the longer term I will build without the bucket directory listing and hope that S3FS fixed what ever weird behaviour that was solving. The build is using the latest version of S3FS
Yup, if you pass CellProfiler a -i it assumes you want it to check that folder for more image sets, so good to set it thoughtfully. We do have that note in our Troubleshooting documentation but happy for documentation PRs if there are other places you think it might help!
Thanks @bethac07, in the troubleshooting is excellent. You can't help people who don't read the documentation by documenting in more places :P
In my defence I have read a lot of the documentation, I forgot to check the troubleshooting this time although I do know it exists 🙈
Thanks!
Hi
I'm attempting to read data from a bucket which is managed by Signals Image Artist (SImA) and is storing our raw images. SImA has a structure of
<bucket>/<dataset-uuid>/<images>
. This means that this section of code is literally taking 6hrs to list all the data in within the top level directories. Unlike the more nested JUMP data structure.I have commented out this line to see if I could speed things up. Frustratingly whether I process one well-site or a whole plate it is still taking >6hrs for the job to start.
For example, if I hack the container to run forever and not to look in the queue, then running
cellprofiler
intmux
with the added-L DEBUG
I can see:at this point nothing will happen for 6hrs or so, then I get:
and the above is with an S3FS mount I created for a different bucket with some test data so it isn't as large as our SImA managed bucket incase you notice the non-standard
/home/ubuntu/destination_bucket
path. Still it's taking a crazy length of time and it's true with the standard DCP setup.Once
cellprofiler
is running that job took a minute as it was just 1 well-site worth of input.I have copied the data for a well-site into a local directory and without going via the S3FS mount
cellprofiler
starts processing the data pretty much instantly.Do you have any idea what could be happening?
I apologise this is bit of a scattered report, I've run a few too many tests with different errors and then take a working day to run each one! I can provide more detailed timings, I keep forgetting to add
time
:(Thanks Gareth