HYPJUDY / Decouple-SSAD

Decoupling Localization and Classification in Single Shot Temporal Action Detection
https://arxiv.org/abs/1904.07442
MIT License
96 stars 19 forks source link

how to deal with short videos #19

Closed jalaxy33 closed 4 years ago

jalaxy33 commented 4 years ago

Hi, @HYPJUDY . I try to use this code on my own dataset with a large amount of short videos. After the frame extraction, many videos produce very few frames and one of them is merely 8 frames to extreme, which is much fewer than the required frame number in the code. If I don't get it wrong, the required minimum frame number is 128. So, my question is, how should I adjust the code to the short videos with few frame number? It will be extremely helpful to point out specifically which places to be modified in the codes. Thanks!

HYPJUDY commented 4 years ago

Hi~

  1. You can try to modify the network. For example, to use fewer layers, so that the network can deal with shorter videos in overall. You may need to tune the network parameters.
  2. For THUMOS14 dataset, I didn't resize videos. But in your case, maybe you can try to resize videos to proper length firstly and then extract the frames. Or you can resize the extracted video features. Following are some codes for your reference.
# Resize video feature to specific temporal scale.

# a direct implementation
def resizePoolFeature(data,feature_save_file,feature_dim=200,\
                      temporal_scale=512,num_sample=3):
    # first resize the length to num_sample*resize_len, 
    temporal_scale = temporal_scale * num_sample
    originalSize=len(data)
    if originalSize==1:
        data=np.reshape(data,[-1])
        return np.stack([data]*temporal_scale)
    x=np.array(range(originalSize))
    f=interp1d(x,data,axis=0)
    x_new=[i*float(originalSize-1)/(temporal_scale-1) for i in range(temporal_scale)]
    y_new=f(x_new)
    # result length is resize_len
    result = np.zeros((temporal_scale / num_sample, feature_dim))
    # then calculate the mean of every num_sample feature
    for i in range(temporal_scale / num_sample):
        result[i] = np.mean(y_new[i:i+num_sample,:], axis=0)
    np.save(feature_save_file, result)

# a more complicated implementation
def poolData(data,video_frame,video_second,feature_save_file,sample_step=8,\
    feature_dim=200,temporal_scale=512,num_sample=3,pool_type="mean"):
    # temporal_scale is the resized temporal scale
    # feat dimeansion is the dimension of input feature
    # num_sample is the sample times for each temporal location
    # pool_type is the method for pooling, using mean or max

    # each feature vector corresponding sample_step frames
    # thus feature frames is sample_step * length of data
    feature_frame=len(data)*sample_step

    # corrected_second is the length the corresponding length of feature sequences
    corrected_second=float(feature_frame)/video_frame*video_second
    fps=float(video_frame)/video_second
    st=sample_step/fps

    if len(data)==1:
        video_feature=np.stack([data]*temporal_scale)
        video_feature=np.reshape(video_feature,[temporal_scale,feature_dim])
        return video_feature

    # x is the temporal location corresponding to each location  in feature sequence
    x=[st/2+ii*st for ii in range(len(data))]
    f=interp1d(x,data,axis=0)

    video_feature=[]
    zero_sample=np.zeros(feature_dim)
    tmp_anchor_xmin=[1.0/temporal_scale*i for i in range(temporal_scale)]
    tmp_anchor_xmax=[1.0/temporal_scale*i for i in range(1,temporal_scale+1)]        

    for idx in range(temporal_scale):
        xmin=max(x[0]+0.0001,tmp_anchor_xmin[idx]*corrected_second)
        xmax=min(x[-1]-0.0001,tmp_anchor_xmax[idx]*corrected_second)

        if xmax<x[0]:
            video_feature.append(zero_sample)
            continue
        if xmin>x[-1]:
            video_feature.append(zero_sample)
            continue

        plen=(xmax-xmin)/(num_sample-1)
        x_new=[xmin+plen*ii for ii in range(num_sample)]
        y_new=f(x_new)

        if pool_type=="mean":
            y_new=np.mean(y_new,axis=0)
        elif pool_type=="max":
            y_new=np.max(y_new,axis=0)

        video_feature.append(y_new)
    video_feature=np.stack(video_feature)
    np.save(feature_save_file, video_feature)
jalaxy33 commented 4 years ago

Thanks for your thoughtful help! I'll try them.

jalaxy33 commented 4 years ago

Thanks for your help. I have successfully run dssad on my own dataset last week. I am sorry that I forgot to close this issue in time. So I close this now.

mrlihellohorld commented 3 years ago

Thanks for your help. I have successfully run dssad on my own dataset last week. I am sorry that I forgot to close this issue in time. So I close this now.

Hi, I have also run dssad on my own dataset, I want to recognize six categories, but the model only identified two of them, these two types of actions lasted a long time(3s-20s), while the general duration of undetected actions was only about 1s. In the final calculation of AP, only two types of AP are calculated. Have you ever had this problem?

HYPJUDY commented 3 years ago

Glad you solved it @jadada ! Hi @mrlihellohorld. It seems reasonable if the videos are too short to recognize. Could you try the two methods I metioned above to deal with the short videos?