16-bit segmentation mask

Rajrup commented 1 year ago

Amazing work! The results from the datasets that you shared are excellent. Thanks for sharing!

One question: I am trying to run RT-SBS on my collected videos containing a person. How can I generate the 16-bit PSPNet segmentation mask for these videos? I'm particularly interested in the person class.

Swayzzu commented 1 year ago

Same issue here

cioppaanthony commented 1 year ago

Hi @Swayzzu and @Rajrup,

Thank you for your nice comment and interest in our work!

These 16-bits images represent the "semantic foreground" output probability per pixel with 16-bit precision. You can easily extract them with any segmentation network by taking the output of the network right before the argmax that selects the highest class probability as class value. Once you get the output probabilities per class and per pixel of the network, the probabilities of the foreground classes (the one that are interesting for you) are aggregated (by addition) so that you get a single channel.

To be more precise, the semantic segmentation network PSPNet trained on the ADE20K dataset outputs a vector containing 150 real numbers for each pixel, where each number is associated to a particular object class within a set of 150 mutually exclusive classes. The semantic probability estimate is computed by applying a softmax function to this vector and summing the values obtained for classes that belong to a subset of classes that are relevant for motion detection. We use the subset: person, car, cushion, box, boot, boat, bus, truck, bottle, van, bag and bicycle, whose elements correspond to moving objects of the CDNet 2014 dataset.

Now it is not mandatory to work at this exact precision. 64-32 or even 8 bit precision would work similarly and you could choose a different set of classes of interest.

Let me know if this is clearer now and don't hesitate if you have any extra question.

Swayzzu commented 1 year ago

Hi @cioppaanthony ,

Thank you so much for your reply!

I do have some extra questions:

After interested classes are chosen and the probabilities are aggregated, what I get is a float number (for example, 0.5), apparently it's not a 16-bit number. I tried the following way: int(0.5 * 65535) = 32768. Is this the right way to do it? Since it is way too simple, I'm worried that what I did is wrong.
From you code, I can see four thresholds: thresh of FG and BG for segment_semantics, and thresh of FG and BG for segment_no_semantics. Why did you choose these thresh numbers?
If I use my custom dataset and the number of classes is changed, do I need to modify these thresh numbers? If so, how do I choose thresh numbers to fit my dataset?

cioppaanthony commented 1 year ago

Hi @Swayzzu,

Yes, that is the correct way. As you can see in the argument parser, the threshold values are integers. This is one way to represent "16 bits" information (not optimal I agree, but it was the one used in the original SBS code so I kept it for consistency).
The threshold values were optimized using a Bayesian optimization strategy on the CDNet 2014 dataset with the overall F1 score as optimization criterion.
Since CDNet contains various video categories, the default thresholds should already provide a good baseline performance. Afterwards, you could optimize them for your own dataset through a grid search or Bayesian optimization strategy.

Swayzzu commented 1 year ago

Thank you! That's very helpful!

cioppaanthony / rt-sbs

16-bit segmentation mask #8