I will try to answer your questions in the following:
The stereo images make the depth training more accurate, are crucial for the mask module refinement, and are needed for generating the auxiliary masks.
For the depth module, yes, for the mask module, as I mentioned above, we need the stereo images for the mask module refinement and generating auxiliary masks.
I wanted to clarify the following: