Add more video information to media objects and associations

JordanCheney commented 8 years ago

Hello all,

A few performers had similar requests for more accessible video information in the recent data call. This PR aims to provide more information in 2 ways-

1) it adds an 'id' field to janus_media objects. The id is unique for media extracted from different sources and the same for media extracted from the same source, (i.e. two video frames extracted from the same video would have the same id while two images would not). It is populated with the SIGHTING_ID field in the protocols. 2) Populate the 'frame_number' field in janus_attributes in the case where a rectangle is provided. This value is set to a frame number in the case of a video and NAN in the case of an image. It is populated with the FRAME_NUM field in the protocols.

@carlosdcastillo, @nxwhite-str I believe you both have mentioned you could benefit from this so can you lead the review?

nxwhite-str commented 8 years ago

Hi Jordan,

I'm editing to add that the below refers just 1) in your comment. For 2), we wouldn't currently make use of the frame number, assuming the video frames are provided in the correct order, but don't mind the extra info.

Thanks for proposing this change. If I'm understanding it correctly, one janus_media object with a single-element data member would be created for each video frame or standalone image, and the API implementation is expected to use the newly provided janus_media.id field to group items for processing as it deems fit.

That strategy would work for our processing chain, and given that the code looks fine, but it seems counter to the original intention of the API. The docstring for the janus_media.data member says: "A collection of image data of size N, where is the number of frames in a video or 1 in the case of a still image." With the proposed change, it seems that N will always be 1, so there is no point in the data member being a vector.

Although either strategy would work for us, I would have a preference for the one described in the janus_media.data docstring, as it would avoid an intermediate data structure and some conversion logic in our code and thus slightly simplify our processing. We then wouldn't need the actual sighting_id value if video frames were grouped into the same janus_media object. If there is something that I've missed please let me know.

Regards, Nate

JordanCheney commented 8 years ago

Hi Nate,

I'm trying to address 2 use cases here- in the 1N video protocol we are processing whole videos, in the 1N mixed protocol we are processing selected video frames.

For 1N video the process works like the docstring indicates- the entire video is loaded into a single media object and image data would contain N frames. The id in that case would be unique among all other media loaded in.

In the 1N mixed protocol however we've selected and annotated only specific video frames. Currently, the harness creates individual media objects for each of those frames and the id can be used to tell which objects are extracted from the same source video.

An alternative approach would be to group those extracted frames into a single media object and then indicate in the associated janus_track which frame numbers correspond to which frames. I originally thought this might be confusing because in the 1N video case you would have sequential frames in the image data, while in the 1N mixed you wouldn't.

If the alternative is cleaner and/or easier then this version I'm happy to put an implementation together, just let me know.

Thanks! Jordan

nxwhite-str commented 8 years ago

Thanks Jordan, that makes sense. I had assumed (I know, dangerous!) that the final behavior would be as you describe in your 4th paragraph. That makes more sense to me, but I don't feel strongly about it. Perhaps this could be a short discussion topic at next week's meeting if others haven't weighed in by then.

Our system could infer whether we were processing full framerate vs. downsampled video data based on whether we receive a janus_media.data with multiple elements, but a matching janus_track.track with only a single element (and thus need to do our own detection and tracking from that seed observation), vs. having a track element for every frame (where we could process using either ground truth boxes or our matching detections).

carlosdcastillo commented 8 years ago

Hello,

This is all very positive. Thank you for carefully looking at this.

The SIGHTING_ID I think is related to what in the first few versions of CS3 was the MEDIA_ID.

In our current implementation we've written code to handle fully structured janus_tracks with each of its medias being one of the following:

an image,
the key frames of a video
all the frames of a video

Observe that a sequence of these janus_track items, this is a faithful representation of what a template is. I think this is what @nxwhite-str refers to as no need for intermediate structures or conversions.

If our API implementation is called like that (with a janus_association that fully mimics the template structure) we don't need the MEDIA_ID or the SIGHTING_ID or whatever, but they have to be correctly structured. I think this is pretty much in line with what @nxwhite-str is saying.

I think this is the way to go.

Best,

Carlos

JordanCheney commented 8 years ago

Hi Carlos,

You are correct and MEDIA_ID was renamed SIGHTING_ID. It sounds like you and @nxwhite-str are in agreement on the easiest path forwards for both of your implementations. I will put together an implementation and update this PR.

While I do that @stephenrawls will this cause any issues for you?

Thanks, Jordan

stephenrawls commented 8 years ago

@JordanCheney -- As long as we can tell the difference between 1-N Video and 1-N Mixed, I don't have a strong opinion on the exact mechanics. If I'm following your description correctly, then I think what you've described is fine with us.

Thanks, Stephen Rawls (ISI)

JordanCheney commented 8 years ago

@nxwhite-str @carlosdcastillo @stephenrawls Hi all, this most recent update should make the evaluation harness pass extracted frames in the same media object. Could you please review when you have a chance?

nxwhite-str commented 7 years ago

@JordanCheney Hi Jordan, sorry for the delay in responding. This all looks good to me for the mixed and image protocols. Perhaps it is outside the scope of this update, or I missed it, but I don't see code in janus_io.cpp to handle the video protocol by passing the video filename instead of the image filename into janus_load_media.

JordanCheney commented 7 years ago

@nxwhite-str I'm not sure I'm following, the parsing of the media file name is implementation specific (depending on how janus_load_media is implemented in the janus_io_library). In the OpenCV implementation we provided we try and load the file as image first and if that fails load it as a video. I don't think there needs to be special processing outside of janus_load_media unless you have a use case I'm not seeing?

nxwhite-str commented 7 years ago

@JordanCheney As an example, I'm looking at the cs3_1N_probe_video.csv file (from the 080316 update), where the first four columns are TEMPLATE_ID, SUBJECT_ID, FILENAME, and VIDEO_FILENAME. For the first template in the file (id 21912), the FILENAME is frames/175312.png and the VIDEO_FILENAME is video/12120.mp4. With the current code and proposed changes, it seems the png will be successfully loaded and passed to janus_create_templates as a standalone image, and the mp4 file ignored.

JordanCheney commented 7 years ago

@nxwhite-str You are right I'm sorry I didn't understand what you were saying. I think that is out of scope for this change and I will push the update for that as a separate PR.

Noblis / INVSC-janus

Add more video information to media objects and associations #41