compare edge detection with semantic segmentation for 3 remotes

violoncelloCH commented 1 month ago

After sharing my progress on the segmentation using facebooks segment anything model, the feedback I got was discouraging. The segment anything model does, as it says only segmentation. No labelling, no prompting by words. The prompting that can be done with it is giving it a coordinate or a box for which the modle then finds the matching element around/inside it recognizes on the image. For reference, here the original task I had: Research the semantic segmentation from facebook #43 To double check, I searched again for "semantic segmentation facebook", but at least for me the five first pages of search results don't yield anything else than the segment-anything.com model. So at this point I do not know if I should continue working with the segment-anything model or if I need to restart with another model. For option A, I would extract the center of the masks for the coordinates and we could probably filter them in the same (or a similar) way as @nourguermazi already works on for the edge-detection method. For option B I'd need a pointer (ideally a url) to what model/framework I should work with. @feeds do you have any input/optinion? Considering that I've got four other projects to work on, I'm currently blocked on this since I can't afford working on something I'm not sure is the right direction.

For reference, here is what the mask output gives on a set of pictures taken in the plotter: (masks displayed as semi-transparant coloured overlays)

gruvw commented 1 month ago

Hey, thank you for posting feedback and providing images on GitHub too!

Personally (as I already told you), I don't see this as discouraging at all. I think this model segments the image pretty well, actually. We can clearly see the buttons being detected correctly and having that kind of masks is better than the opencv algorithm I think (example the volume up, volume down buttons).

I was just surprised to hear/learn only this week that the algorithm used isn't about "semantic". When I heard "semantic segmentation" I thought that on top of providing boxes to segment the image it would also produce labels that go with them to extract where the buttons are, or being able to prompt it with words like "segment the remote's buttons on this image" (something like ChatGPT for image segmentation). This algorithm from facebook does segmentation very well (which is not bad for our use case as opencv does not provide much more anyway) but it does not do "semantic segmentation". I was surprised to learn this only now as there has been 3 issues, so 3 weeks (about 50 hours of work) about researching/trying out alternatives to opencv (#10, #43 and this one #52).

From now, I think we still need to extract a valid config out of this segmentation (like it was mentioned in #43) from Facebook (with similar technic that Nour used) to actually have a real comparison between this and the opencv implementation.

I also think we could search online for other algorithms, maybe not from Facebook in particular, that does "semantic segmentation", either with textual prompts or that segments with labels as output.

gruvw commented 1 month ago

With a quick google search I found a few things about semantic segmentation, here are the links if you need them:

I don't know if those are exactly what we need, but they are for sure a good starting point.

gruvw commented 1 month ago

I also just tried the ChatGPT paid API (about 10-20 cents per query) out of curiosity to try it out and it could also maybe work, we could use their API. It would require a little more in depth trial and errors to find a good prompt but here is the output for the electric cooler remote.

Anyway, just to say there are many ways we could try to get coordinates for buttons on a remote. It just requires a little bit of time to try out different technics.

feeds commented 1 month ago

Hey, I think the segmentation from facebook in the images looks great! It is much more precise than what you were able to get from opencv. Indeed, there are not semantics here, but this is the same for the edge detection so from my point of view this is a great improvement.

And you are right, there needs to be some kind of post-porcessing that gets out the buttons. This can be done

by showing button masks to the user, and asking the yes/no question, as suggested during the last meeting (either in the app or in the python user interface)
doing some post-processing via ChatGPT as Lucas suggested
I also found the grounded dino model which does zero shot semantic segmentation and seems to work pretty well (https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/zero-shot-object-detection-with-grounding-dino.ipynb -- this is the model I used to produce the images below) If you combine dino, first extracting a "remote control object", then searching for bounding boxes of "button" instances withing, and then use the segmentation masks to finetune the position where you want to press, the results should be pretty workable.

nourguermazi commented 1 month ago

this is the progress so far with edge detection:

There are a lot of parameters to play with, if I use a completely different remote it doesn't really work sometimes (but could be fixed with a lot patience)

violoncelloCH commented 1 month ago

Hi @feeds We are currently working together at Spot and there are different opinions on whether we agreed on continuing or not the work on the segmentation using the segment-anything (and potentially the semantics) model. What was the conclusion there? I can't find any written trace about the decision here in the issues, but this one is still open. Thanks and have a nice evening!

feeds commented 1 month ago

I recall that this was not the priority as it was unclear how fast you could fix the other issues that will lead to a working hardware alltogether.

Of course it would be good to have it as this would increase the reliability of using the helping hand, but a working hardware is more of an issue.

unglazedstamp commented 1 month ago

Semantic segmentation is more generalizable, so we abandonned edge detection.

epfl-cs358 / 2024sp-helping-hand

compare edge detection with semantic segmentation for 3 remotes #52