MCG-NJU / MOC-Detector

[ECCV 2020] Actions as Moving Points
MIT License
267 stars 37 forks source link

Data processing #17

Closed nthhiep closed 3 years ago

nthhiep commented 4 years ago

I have some questions related to flip_test mode.

  1. In "normal_moc_det.py"/preprocess(), line 62, why do you convert the red channel of "flip_data". What does this mean? temp[:, :, 2] = 255 - temp[:, :, 2]

  2. In "normal_moc_det.py"/process() function, why don't you take the average of rgb_mov and rgb_mov_f (as well as flow_mov and flow_mov_f) like heatmap and wh output (lines 88,89, 100,101) ?

  3. rgb_output[1]['mov'], flow_output[1]['mov'] are computed for nothing?

It's the same for stream_moc_det.py. I hope to get your explanation. Thank you for your reply.

ArchiZX commented 4 years ago
  1. This is a specific channel for brox-flow rather than the red channel.

  2. I tried, but this had no use.

  3. Yes.

nthhiep commented 4 years ago

Thank you for your response. I have another question. In fact, the format of ground-truth tubes in "UCF101v2-GT.pkl" is as follows:


gttubes  = { 
         'parentfolder/videoname': {class: [
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]])
                  ...
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]])      ]}

         ...

         'parentfolder/videoname': {class: [
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]])
                  ...
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]])      ]}
}

Here, the datasets are single-object? Each video contain only one action? And the class/identification of tubes is the class of videos (or the index of the parent folder's name)? In theory, your model is multi-object tracking, but it is trained by single-object data?

How about the general problem where there are multiple objects or multiple actions of different types in videos? For example: 1) video with 2 people jumping -> need to identify, or need to separate the tube boxes of each one 2) video with one jumping, one walking -> need to classify as normal

In this case, there exists only the class/identification for tubes, not for videos? And _gttubes[label] must be a dictionary of multiple elements, as follows?

gttubes  = { 
         'parentfolder/videoname': {  class: [
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]])
                  ...
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]]) ]

                                    class: [
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]])
                  ...
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]])]      
                                    ...}
         ...

         'parentfolder/videoname': {  class: [
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]])
                  ...
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]]) ]

                                    class: [
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]])
                  ...
                  array([[frame,x1,y1,x2,y2],...,[frame,x1,y1,x2,y2]])]      
                                    ...}
}

Many thanks,

nthhiep commented 4 years ago

Oh, flow-images are represented by HSV format, where the 0-channel means the direction and the 2-channel means the magnitude of the movement? So, when we flip images, we have to flip the direction of object movement. Thanks for the information, I forgot that.

ArchiZX commented 4 years ago

UCF101-24 is a multi-objects dataset but JHMDB-21 is a single-object dataset. (see our gifs)

According to my observation, both datasets are single-action as you declare.

I don't know the generalization performance for multi-actions. And indeed, the community demand for a newly large non-atomic multi-actions/multi-objects action detection dataset.

nthhiep commented 4 years ago

I checked in UCF101v2-GT.pkl and found that UCF101-24 is not only a single-action but also single-object dataset. In every video, only one object is annotated with box during the video (dispite the video may contain many objects). So, UCF101-24 is a single-object tracking dataset.

We have 
len(self._gttubes[v]) = 1 for every video v in  in self._gttubes

The action tube can be interrupted, or it is divided in many segments. For example:


'Basketball/v_Basketball_g18_c02': {0: [array([
       [  1., 161., 137., 222., 235.],
       [  2., 161., 137., 222., 235.],
       [  3., 161., 137., 222., 235.],
       [  4., 161., 137., 222., 235.],
       [  5., 161., 137., 222., 235.],
       [  6., 161., 137., 222., 235.],
       [  7., 161., 137., 222., 235.],
       [  8., 161., 137., 222., 235.],
       [  9., 161., 137., 222., 235.],
       [ 10., 162., 137., 223., 235.],
       [ 11., 162., 137., 223., 235.],
       [ 12., 163., 137., 224., 235.],
       [ 13., 163., 137., 224., 235.],
       [ 14., 163., 137., 224., 235.],
       [ 15., 163., 137., 224., 235.],
       [ 16., 163., 137., 224., 235.],
       [ 17., 163., 137., 224., 235.],
       [ 18., 163., 137., 224., 235.],
       [ 19., 163., 137., 224., 235.],
       [ 20., 163., 137., 224., 235.]], dtype=float32), array([[ 72., 163., 146., 219., 238.],
       [ 73., 163., 146., 219., 238.],
       [ 74., 163., 146., 219., 238.],
       [ 75., 163., 146., 219., 238.],
       [ 76., 163., 146., 219., 238.],
       [ 77., 163., 146., 219., 238.],
       [ 78., 163., 146., 219., 238.],
       [ 79., 163., 146., 219., 238.],
       [ 80., 163., 146., 219., 238.],
       [ 81., 163., 146., 219., 238.],
       [ 82., 163., 146., 219., 238.],
       [ 83., 163., 146., 219., 238.],
       [ 84., 163., 146., 219., 238.],
       [ 85., 163., 146., 219., 238.],
       [ 86., 163., 146., 219., 238.],
       [ 87., 163., 146., 219., 238.],
       [ 88., 163., 146., 219., 238.],
       [ 89., 163., 146., 219., 238.],
       [ 90., 163., 146., 219., 238.],
       [ 91., 163., 146., 219., 238.],
       [ 92., 163., 146., 219., 238.],
       [ 93., 163., 146., 219., 238.],
       [ 94., 163., 146., 219., 238.],
       [ 95., 163., 146., 219., 238.],
       [ 96., 163., 146., 219., 238.],
       [ 97., 163., 146., 219., 238.],
       [ 98., 163., 146., 219., 238.],
       [ 99., 163., 146., 219., 238.],
       [100., 163., 146., 219., 238.],
       [101., 163., 146., 219., 238.],
       [102., 163., 146., 219., 238.]], dtype=float32)]}

There are two tube segments in "Basketball/v_Basketball_g18_c02" video. However, the objects in the tubes are the same. So, the UCF101-24 is is single-object tracking dataset.

ArchiZX commented 4 years ago

gttubes : dictionary that contains the gt tubes for each video. Gttubes are dictionaries that associate from each index of label, a list of tubes. A tube is a numpy array with nframes rows and 5 columns, .

len(self._gttubes[v]) = 1 represents single-action rather than single-object.

And try to check len(self._gttubes[v][class_index])

For example, len(pkl['gttubes']['Fencing/v_Fencing_g04_c03'][6]) ---> 4

image

nthhiep commented 4 years ago

Thank you very much for this example. I'm wrong. I drawn the boxes for Basketball/v_Basketball_g18_c02 and think that it's the same for other videos. Thanks again.

xjsxujingsong commented 2 years ago

Hi just found this issue. Can the proposed method support multi person multi action in one frame? such as in ava dataset?

yixuanli98 commented 2 years ago

Yes, the Center Branch uses the focal loss and can handle multi-label classification.