Architecture for Human Segmentation?

InternetMaster1 commented 4 years ago

Thanks for the amazing library!

I am looking to implement high-quality semantic segmentation on a mobile device for human cutout (full body).

1) Architecture?

What architecture/encoder would be a good choice for the task at hand? MobileNetV2, MobileNetV3, DeeplabV3+, ShuffleNet, PortraitNet, SINet.... There are so many, its confusing.... https://github.com/qubvel/segmentation_models.pytorch

I wanted highest-acccuracy, rather than smallest or fastest

2) Objects held by Person?

In the final output mask, how can I even get the objects that a person is holding, say a cup, a purse, a tennis racquet, a balloon, a toy, a magazine. It could be just about anything.

I am very much perplexed with this problem.

For training of human segmentation, I was planning to use the Supervisely Person dataset. If I am not mistaken, the Supervisely dataset doesn't contain masks for objects that the person might be holding. To achieve this, would a dataset like Supervisely be unfit for the job? Or we need to train on a dataset with more labels than just "person"?

But ideally, if an object is lying on the side, it is ok if it does not come in the mask. But if the person is holding the object, it should definitely come in the final mask.

How can this be achieved?

Thanks!

anilsathyan7 commented 4 years ago

It would be a good idea to start with deeplab model for full person.Try out the sample 21 class model trained with pascal_coco in tensorflow webiste which already contains the person class.

If you want to train on your own data, use a high resolution 513x513 input and depth multiplier 1(or more) for highest accuracy. This may increase the overall inference time; but since you are striving for highest-accuracy you need to make some trade-offs for speed.

Supervisely dataset does not have proper masks for connected objects(to person). Try removing such data from supervisely and use them in combination with pascal/coco person datasets. Sometimes it seems to be ambiguous regarding inclusion of connected objects(i. is it a connected object or an object at back-side partially occluded by person?). In any case, you need to ensure you have sufficient number of images (wit/without connected) for training, as per your specific use-case. I have not tried any other techniques for the including the connected objects.

InternetMaster1 commented 4 years ago

Many thanks for the detailed answer,

Deeplab Model Do you mean the DeeplabV3+ variant? And what about the model, say mobilenetv2, mobilenetv3, resnet50, portraitnet, etc. Are you aware of any chart of a comparison of the accuracy/speed of all these models?
Thanks for the helpful tip about input size.
Wow. That really sheds a lot of light into how to handle connected-objects. Things are far more clearer now.

I had a few questions :

A) What is the license of your amazing library? B) If you were to recommend a model from your library, which would you say be the most suited to my task? You have tried out a lot many combinations.

Thanks!

anilsathyan7 / Portrait-Segmentation

Architecture for Human Segmentation? #10