Hello
How are you?
Thanks for contributing to this project.
If there are 3 persons in a room and they is doing different independent actions each other, does this method extract such multiple independent action captions?
And is it possible to localize the region (position) of acting object (person)?
Hello How are you? Thanks for contributing to this project. If there are 3 persons in a room and they is doing different independent actions each other, does this method extract such multiple independent action captions? And is it possible to localize the region (position) of acting object (person)?