Localizing positions of objects in a scene

Hello How are you? Thanks for contributing to this project. In general, there is NOT ONLY one object in a scene. So if there are multiple objects in a scene and actions of the objects (ex: person) are different, we need to localize the object's position. Is it possible to localize positions of all the objects for one video caption? If it is impossible for right now, do u know any solution or method for this purpose?

X-PLUG / mPLUG-2

Localizing positions of objects in a scene #13