chaoyuaw / pytorch-coviar

Compressed Video Action Recognition
https://www.cs.utexas.edu/~cywu/projects/coviar/
GNU Lesser General Public License v2.1
502 stars 126 forks source link

Some questions about coviar data loader #7

Closed Manolo1988 closed 6 years ago

Manolo1988 commented 6 years ago

Hi, I have some questions when reading the coviar_data_loader.c Firstly, you init the variable accu_src_old as follows but i whether why:

                    for (size_t x = 0; x < w; ++x) {
                        for (size_t y = 0; y < h; ++y) {
                            accu_src_old[x * h * 2 + y * 2    ]  = x;
                            accu_src_old[x * h * 2 + y * 2 + 1]  = y;
                        }
                    }

Secondly, is the following codes means that every frame in the target gop before target frame will be decoded, and only the I-frame and the target frame will be transit to bgr format?

            if (cur_gop == gop_target && cur_pos <= pos_target) {
                ret = avcodec_decode_video2(pCodecCtx, pFrame, &got_picture, &packet);  
......
                if (got_picture) {

                    if ((cur_pos == 0              && accumulate  && representation == RESIDUAL) ||
                        (cur_pos == pos_target - 1 && !accumulate && representation == RESIDUAL) ||
                        cur_pos == pos_target) {
                        create_and_load_bgr(
                            pFrame, pFrameBGR, buffer, bgr_arr, cur_pos, pos_target);
                    }

Thirdly, in dataset.py, I whether why you process the img like follows:

def clip_and_scale(img, size):
    return (img * (127.5 / size)).astype(np.int32)
Thanks for your excellent work and code. Looking forward to your reply : )
chaoyuaw commented 6 years ago

Hi @Manolo1988 ,

Thanks for your questions.

  1. To get accumulated MV and residuals, we keep track of "where a pixel moves in the following frames". Initially, the location of a pixel at (x, y) is (x, y). So we initialized it as (x, y). And through motion compensation at the following frames, the pixel might be copied to other locations. We compare the final location with the original location to get accumulated motion vectors (and accumulated residuals). Please feel free to let me know if this makes sense to you.

  2. I'm not sure if I fully understand your question, but I'll try to answer based on my understanding. Please let me know if this answers your question. The reason why we construct BGR is to compute accumulated residuals, which is the difference between the predicted frame (without adding residual along the path) and the actual frame. Predicted frame is a function of MVs and I-frame. So we only need MVs of the frames in between the target frame and I-frame, without the need to decode BGR for them.

  3. This is following the convention from two-stream networks, where optical flows are clipped at certain magnitude. Here we we scale [-20, 20] to [0, 255], and values beyond the range are clipped.

Manolo1988 commented 6 years ago

Thank you for your reply. I finally figure out the meanings of the above codes and highly appreciate your ideas.

RyanCV commented 6 years ago

@chaoyuaw 1. why x_start and y_start start with negative value?

for (int x_start = (-1 mv->w / 2); x_start < mv->w / 2; ++x_start) { for (int y_start = (-1 mv->h / 2); y_start < mv->h / 2; ++y_start) { ... }

  1. in line 288-296, why bgr_arr has dims[4]? what does dims[0] mean?

// Initialize arrays. if (! (bgr_arr)) { npy_intp dims[4]; dims[0] = 2; dims[1] = h; dims[2] = w; dims[3] = 3; bgr_arr = PyArray_ZEROS(4, dims, NPY_UINT8, 0); }