Run `segmenter` offline

AdrienWehrle commented 1 month ago

Hi,

Thanks a lot for all your work! It's very nice browsing through your open source code. :)

The segmenter being quite expensive to run on a lot of images, I'd like to run it on a machine with more resources after exporting the images, without any open connection to the planktoscope.

Browsing through the backend, it feels like I could work on an alternative version of the SegmenterProcess class where I'd ignore all the MQTT parts and client setup, and only focus on the skimage processes. So I think basically taking _slice_image() and running it outside of the class/writing a new shorter one.

I was just wondering if you think that's a good idea, or if you'd have a better solution?

Thanks a lot!

ethanjli commented 1 month ago

Hi, Adrien - thanks for opening this issue! Being able to run the segmenter on other computers is something we want to make easy, and progress on that desired improvement is tracked at https://github.com/PlanktoScope/PlanktoScope/issues/371 and https://github.com/PlanktoScope/PlanktoScope/issues/378 . Currently my stopgap solution (and what I run on my laptop to process my PlanktoScope datasets) is a way to use Docker to run the segmenter together with the MQTT broker and a reduced version of the Node-RED dashboard which are currently required to control the segmenter: https://github.com/PlanktoScope/pallet-segmenter ; this is a very inelegant approach, but it was the easiest/fastest approach for me to get something on my laptop. However, one person who tested it encountered some issues with it which I haven't been able to reproduce on my computer (see https://github.com/PlanktoScope/pallet-segmenter/issues/1 for details), and I haven't had the capacity to troubleshoot those problems. I am pretty confident that we could avoid those kinds of problems entirely if we could perform image-processing with a command-line command or by using the segmenter as a Python library (e.g. to run in a Jupyter notebook).

If you look at https://github.com/PlanktoScope/PlanktoScope/issues/378 , you can see that I would like to refactor the segmenter to separate the MQTT API from the actual image-processing functionality, so that we can invoke the image-processing functionality without relying on an external MQTT broker and client. However, we have not allocated time to do this refactoring (details in https://github.com/PlanktoScope/PlanktoScope/issues/378#issuecomment-1997928156). It'd be very helpful if you could do one or more of the refactoring steps listed in https://github.com/PlanktoScope/PlanktoScope/issues/378, and I'd be happy to work with you on ensuring that the refactoring proceeds smoothly (in terms of minimizing potential merge conflicts by merging multiple smaller PRs consisting of incremental changes, and by coordinating to prevent major merge conflicts with the FairScope interns working on other changes to the segmenter in https://github.com/PlanktoScope/device-backend/pull/40). If you're interested, we can start by discussing some potential initial approaches to splitting up the segmenter - we can discuss either here on GitHub or in any of the weekly PlanktoScope software development meetings.

AdrienWehrle commented 1 month ago

Hi @ethanjli !

Thank you a lot for taking time to answer and provide all this info and development context. For now, I forked the backend and commented any MQTT process in SegmenterProcess. And could run the segmenter on my laptop just fine. The quickest stable version of this could be to have an offline optional argument (or another better tag for it) in SegmenterProcess, and put any MQTT code conditioned to that argument, and set it to False by default. That should prevent breaking anything relying on/linked to that class, while making the offline processing available rather easily (only to users looking into the backend though).

I read through https://github.com/PlanktoScope/PlanktoScope/issues/378 but couldn't find precise implementation requests hence my proposition right here. :)

Happy to work on that feature if you think it makes sense :+1:

ethanjli commented 1 month ago

Hi, Adrian! Your approach seems like a great plan for a first iteration of refactoring. Here's my proposed (slightly-modified version of your) plan, to avoid having to figure out a good name for the "offline" optional argument:

Besides code in the SegmenterProcess.run method, any other method in SegmenterProcess which calls methods on self.segmenter_client (which is the MQTT Client) should check if self.segmenter_client is not None before calling any of self.segmenter_client's methods; outside of the SegmenterProcess.run method, any call to any method on self.segmenter_client should be skipped if self.segmenter_client is None.
To run the segmenter as a background worker with the MQTT API (i.e. the current functionality), a SegmenterProcess instance should be launched via its start method - the same as before. So no changes are needed to https://github.com/PlanktoScope/device-backend/blob/main/processing/segmenter/main.py , and there will be no impact on existing segmenter functionality.
To process a dataset in Python without doing anything MQTT-related, you should instantiate a SegmenterProcess (and the event parameter can be left as None) and then call the segment_list or segment_all method with arguments specifying the datasets you want to process.

Based on your testing so far in your forked version of the backend, can you think of any potential problems with this? If not, and if this plan seems clear and reasonable to you, I think this will be a simple and straightforward PR, and I can do the PR review+approval for the changes described above.

PlanktoScope / device-backend

Run `segmenter` offline #43