Open wendy0527 opened 10 months ago
I guess you could first run instance segmentation and then calculate the embedding for each segmented instance. This takes as many forward operations as there are instances in the image, so there may be a more optimal way. Also global context information is lost when embedding each segmented instance individually.
As far as I know, most of the current image retrieval is for a single object or several objects. But for the field of autonomous driving, a picture taken is multi-object, for example, a picture contains people, cars, fences, buildings, trees, etc. Only using [cls_token] will lose a lot of object information. Are there any suggestions for multi-object retrieval?