ExportToDatabase and/or Path Mapping issues with DCP prevent batch data aggregation on AWS RDS - Workaround Added in Comments

mamrrmam commented 2 years ago

Hello DCP community,

I am relatively new to DCP, and have got things running OK but noticed some weird behaviour in my cloud watch logs re: DCP core behaviour.

For starters, I am running DCP (Docker hub tag: cellprofiler/distributed-cellprofiler:2.0.0_4.2.1) on AWS using the ECS-optimized LINUX AMI as suggested in the wiki. I am attempting to write the data to a MYSQL database using ExportToDatabase. I have it set up on AWS RDS which so far has been feasible but with some weird write behaviour. The rest of things are running as expected.

CP does not like to append new data to MYSQL and prefers to overwrite - this is all fairly well documented piecemeal across various issues here and on some forums.

Essentially, my problem boils down to the following - (1) if I use batchfiles to load the data, DCP does not seem to point at the correct directories (2) if I use LoadData to load the data, it points at the correct directories but overwrites my database on each run (3) if I use a combination or batchfiles with load data and use the jobfile.json to point DCP to my image-set-listing.csv in {data:} and to batch_file.h5 in {pipeline}, the result is similar to that in (2), see below for details. NB. for this, I have edited the image-set-listing to point at the correct directories in excel before running the job.

So, lets get to it.

Option 1: Batchfiles - at the end of the pipeline, I create a batchfile to run the job file (so in my jobfile.json, both data and pipeline both point to the batchfile.h5). This causes the error:

error

INFO:__main__:Error detected during run of module NamesAndTypes
Traceback: most recent call last

  | 2022-04-18T20:24:50.408-04:00 | File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/pipeline/_pipeline.py", line 976, in run_with_yield
  | 2022-04-18T20:24:50.408-04:00 | self.run_module(module, workspace)
  | 2022-04-18T20:24:50.408-04:00 | File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/pipeline/_pipeline.py", line 1298, in run_module
  | 2022-04-18T20:24:50.408-04:00 | module.run(workspace)
  | 2022-04-18T20:24:50.408-04:00 | File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/modules/namesandtypes.py", line 1940, in run
  | 2022-04-18T20:24:50.408-04:00 | self.add_image_provider(
  | 2022-04-18T20:24:50.408-04:00 | File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/modules/namesandtypes.py", line 2003, in add_image_provider
  | 2022-04-18T20:24:50.408-04:00 | self.add_simple_image(
  | 2022-04-18T20:24:50.408-04:00 | File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/modules/namesandtypes.py", line 2062, in add_simple_image
  | 2022-04-18T20:24:50.408-04:00 | self.add_provider_measurements(provider, m, "Image")
  | 2022-04-18T20:24:50.408-04:00 | File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/modules/namesandtypes.py", line 2076, in add_provider_measurements
  | 2022-04-18T20:24:50.408-04:00 | img = provider.provide_image(m)
  | 2022-04-18T20:24:50.408-04:00 | File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/image/abstract_image/file/url/_monochrome_image.py", line 33, in provide_image
  | 2022-04-18T20:24:50.408-04:00 | image = URLImage.provide_image(self, image_set)
  | 2022-04-18T20:24:50.408-04:00 | File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/image/abstract_image/file/_file_image.py", line 327, in provide_image
  | 2022-04-18T20:24:50.408-04:00 | self.__set_image()
  | 2022-04-18T20:24:50.408-04:00 | File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/image/abstract_image/file/_file_image.py", line 264, in __set_image
  | 2022-04-18T20:24:50.408-04:00 | self.cache_file()
  | 2022-04-18T20:24:50.408-04:00 | File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/image/abstract_image/file/_file_image.py", line 159, in cache_file
  | 2022-04-18T20:24:50.408-04:00 | raise IOError(
  | 2022-04-18T20:24:50.408-04:00 | OSError: Test for access to directory failed. **Directory: ///E:/XYZ/XYZ/XYZ...**

I have changed my directory here to XYZ/XYZ but as you can see, the batchfile does not properly change the path because DCP is testing for access to a directory on my local machine - I have included a path mapping to my files on S3 (directories that work in the LoadData context, more on this in a minute), but it is not registering in the run. On my local machine, I am running CP 4.2.1 on windows 11, making me thing there is an issue with the path mapping code in this version or some other issue generated by running it in my local environment.

Option 2: LoadData - The other option is using LoadData. For this, DCP likes to continually overwrite the MYSQL database on each run, so I run the CreateBatchFiles module at the end of the pipeline on my local machine once with ExportToDatabase set to Overwrite Data and Schema such that the tables are instantiated on my MYSQL_database. Then, I delete the CreateBatchFiles module, reset the Overwrite settings in ExportToDatabase, save and upload my pipeline to S3. Here, I use the jobfile to point to a manually edited version of the image-set-list.csv in data and for pipeline I point directly at this pipeline.cppipe. I have tried running this with the Overwrite set to Data and Schema and this continually overwrites the database each time an instance runs a job. If you set it to overwrite "Data" only, the following error occurs:

"Detected error during run of export to database module: Feature ExportToDb_Images for Experiment does not exist."

The error is reproduced below for (3). It seems that the error is coming from within the ExportToDatabase module code, wherein it references the h5 dictionary repeatedly. Interestingly, Images and Experiment are both "group" or handle elements of the batch file, so most likely this is failing because it is expecting a batch file but does not have one. However, as you can see, when I provide the batch file it also fails.

Option 3: LoadData/Batchfile Combo - The third option is to use the batchfile to take advantage of the MYSQL write behaviour using this system, but point DCP at the correct directories. This would ideally overcome the limitations of both of issues (1) and (2); however, pointing DCP at a batch file in any context seems to create the same issue that fails in (2). This is also the case if I include the LoadData module in the pipeline but run the jobfile pointed at the batchfile for both {Pipeline:} and {Data:}.

INFO:__main__:Error detected during run of module ExportToDatabase
Traceback (most recent call last):
  | 2022-04-20T00:07:20.417-04:00CopyINFO:__main__:  File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/pipeline/_pipeline.py", line 976, in run_with_yield | INFO:__main__: File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/pipeline/_pipeline.py", line 976, in run_with_yield
  | 2022-04-20T00:07:20.417-04:00 | INFO:__main__: self.run_module(module, workspace)
  | 2022-04-20T00:07:20.417-04:00 | INFO:__main__: File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/pipeline/_pipeline.py", line 1298, in run_module
  | 2022-04-20T00:07:20.417-04:00 | INFO:__main__: module.run(workspace)
  | 2022-04-20T00:07:20.417-04:00 | INFO:__main__: File "/usr/local/lib/python3.8/dist-packages/cellprofiler/modules/exporttodatabase.py", line 2585, in run
  | 2022-04-20T00:07:20.417-04:00 | INFO:__main__: self.record_image_channels(workspace)
  | 2022-04-20T00:07:20.417-04:00 | INFO:__main__: File "/usr/local/lib/python3.8/dist-packages/cellprofiler/modules/exporttodatabase.py", line 4773, in record_image_channels
  | 2022-04-20T00:07:20.418-04:00 | INFO:__main__: image_list = workspace.measurements.get_experiment_measurement("ExportToDb_Images")
  | 2022-04-20T00:07:20.418-04:00 | INFO:__main__: File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/measurement/_measurements.py", line 962, in get_experiment_measurement
  | 2022-04-20T00:07:20.418-04:00 | INFO:__main__: result = self.get_measurement(EXPERIMENT, feature_name)
  | 2022-04-20T00:07:20.418-04:00 | INFO:__main__: File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/measurement/_measurements.py", line 837, in get_measurement
  | 2022-04-20T00:07:20.418-04:00 | INFO:__main__: result = self.hdf5_dict[EXPERIMENT, feature_name, 0]
  | 2022-04-20T00:07:20.418-04:00 | INFO:__main__: File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/utilities/hdf5_dict.py", line 417, in __getitem__
  | 2022-04-20T00:07:20.418-04:00 | INFO:__main__: result = self[object_name, feature_name, [num_idx]]
  | 2022-04-20T00:07:20.418-04:00 | INFO:__main__: File "/usr/local/lib/python3.8/dist-packages/cellprofiler_core/utilities/hdf5_dict.py", line 422, in __getitem__
  | 2022-04-20T00:07:20.418-04:00CopyINFO:__main__:    assert feature_exists, "Feature {} for {} does not exist".format( | INFO:__main__: assert feature_exists, "Feature {} for {} does not exist".format(
  | 2022-04-20T00:07:20.419-04:00 | INFO:__main__:AssertionError: Feature ExportToDb_Images for Experiment does not exist

Conclusions

I have done a lot of poking around in the Batch_file.h5 as well as having many different permutations of Batch_file settings, JobFile submissions, etc. The NamesAndTypes - Directory error is perplexing because I have no idea where the directory to my local machine is being submitted to DCP from. I have gone as far as manually changing all filepaths, url's etc. in the batch file to match my cluster, but the error persists. It is hidden somewhere else, clearly, and I have no idea where.

When I add the filepaths by .csv, clearly the pipeline gets much further in (to ExportToDatabase); however, again an error is thrown. Interestingly, ExportToDb_Images is identical in the batchfile produced when the files are added with LoadData module versus when they are added thru the GUI so I don't have any clue why this is failing. Perhaps there is an additional component required in the .csv that is not there.

Recommendations

For any CP/DCP experts, if you have any comments or advice with next steps for getting my data output to MYSQL without overwriting on each run, please reply.

Finally, thanks to the CP/DCP team for providing a great software and maintaining such a helpful community here.

All the best,

mamrrmam

mamrrmam commented 2 years ago

Hello, for anyone who has arrived here, I have implemented a work-around based on using SQLite with DCP. Hope this can help you save hours and hours of troubleshooting time. In short, the problem with SQLite is that DCP will output a single database for each image processed on an instance. So obviously, MySQL output would be preferable but I could never arrive at a working implementation of that strategy for DCP.

Therefore, I have written code to handle merging and post-processing for large numbers (up to thousands) of databases outputted using either "SingleObjectTable" or "SingleObjectView" from the ExportToDatabase module.

See here: https://github.com/mamrrmam/SQLite-MegaMerge-for-Distributed-CellProfiler

and here for the general merge code without the CP bells and whistles: https://github.com/mamrrmam/SQLite-Mega-Merge

The code is not generalized and can take a while to run with particularly large datasets but it does the job.

Thanks to the CP/DCP community.

ErinWeisbart commented 2 years ago

Thanks so much for sharing your workaround! We have added "Support for RDS" as a requested enhancement to CellProfiler and when that change is made there we will ensure it is compatible with DCP.

DistributedScience / Distributed-CellProfiler

ExportToDatabase and/or Path Mapping issues with DCP prevent batch data aggregation on AWS RDS - Workaround Added in Comments #129