What is the "mask" from the input image?

I am very confused about the "mask" in the detr.py. Could you explain what is this, please? And, if Intput's image was resized to the same size then we don't need "mask" right?

def downsample_masks(self, masks, x):
        masks = tf.cast(masks, tf.int32)
        masks = tf.expand_dims(masks, -1)
        masks = tf.compat.v1.image.resize_nearest_neighbor(masks, tf.shape(x)[1:3], align_corners=False, half_pixel_centers=False)
        masks = tf.squeeze(masks, -1)
        masks = tf.cast(masks, tf.bool)
        return masks

def call(self, inp, training=False, post_process=False):
    x, masks = inp
    x = self.backbone(x, training=training)

Visual-Behavior / detr-tensorflow

What is the "mask" from the input image? #49