Action Recognition - Githubissues

jamessmith90 commented 4 years ago

Looking to recognize if a person is walking or running. Can this be done using darknet ? If yes can you tell me what changes i need to do in the build and what would be the format of the training dataset ?

LukeAI commented 4 years ago

maybe you can use some data from here https://research.google.com/ava/

jamessmith90 commented 4 years ago

@LukeAI I already have the dataset prepared. I just need the changes in the repo and training format.

LukeAI commented 4 years ago

show some examples of data - are you classifying images or detecting the locations of running-person walking-person (maybe more than one per image?)

jamessmith90 commented 4 years ago

I have coordinates and frame number tagged with category -- walking, running, standing

AlexeyAB commented 4 years ago

@jamessmith90

You can try to use LSTM-models f.e: https://github.com/AlexeyAB/darknet/files/3199770/yolo_v3_tiny_lstm.cfg.txt

Since it uses time_steps=16 in yolo_v3_tiny_lstm.cfg.txt then

in train.txt should be placed training images, by 16 consecutive images (frames) from the video, with this action (person is walking)
- or negative samples - 16 consecutive images (frames) from the video, without this action, f.e. person (a person sits, or a street without a person)
- so number of images in train.txt should be multiple of 16.
Change this line start_time_indexes[i] = ((random_gen() % m) / 16) * 16; and recompile https://github.com/AlexeyAB/darknet/blob/5d13aad8879e1630145bb90208db518037d707a3/src/data.c#L55
May be you should leave unmarked the first 8 images, and mark only the last 8 images. This is only necessary if you want to distinguish: whether a person is moving, or the person just froze in a pose similar to movement. In theory, this should be clear after 8 frames.

More: https://github.com/AlexeyAB/darknet/issues/3114#issuecomment-494154586

For training: Download https://pjreddie.com/media/files/yolov3-tiny.weights then do command ./darknet partial cfg/yolov3-tiny.cfg yolov3-tiny.weights yolov3-tiny.conv.14 14 - you will get yolov3-tiny.conv.14 pre-trained file.

Then train as usual: ./darknet detector train data/obj.data yolo_v3_tiny_lstm.cfg.txt yolov3-tiny.conv.14 -map

Then train both models on the same train/valid dataset and compare mAP:

LSTM-model: https://github.com/AlexeyAB/darknet/files/3199770/yolo_v3_tiny_lstm.cfg.txt
and yolov3-tiny_3l model: https://github.com/AlexeyAB/darknet/files/3199607/yolov3-tiny_3l.cfg.txt

Will be mAP higher in LSTM for your case?

If LSTM-model will be better, then I will add param train_seq_frames_num=16 to cfg-file, so you will not need to change the source code.

jamessmith90 commented 4 years ago

What is format for train.txt for 16 consecutive frames and what would be format of the .txt file for each image ?

AlexeyAB commented 4 years ago

Everything is the same as usual.

train.txt

video_1_frame_1.jpg
video_1_frame_2.jpg
video_1_frame_3.jpg
....
video_1_frame_16.jpg
video_2_frame_1.jpg
video_2_frame_2.jpg
...
video_2_frame_16.jpg
....

Files video_1_frame_1.txt as usual https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

0 0.5 0.5 0.2 0.2
1 0.3 0.3 0.1 0.1
....

uday60 commented 4 years ago

I have added the model for training.

I have another doubt. https://www.crcv.ucf.edu/data/UCF101.php UCF-101 has 101 categories for action recognition. Can the same logic be used for video classification ?

jamessmith90 commented 4 years ago

@AlexeyAB Is it necessary to include 16 frames ? Can i use 12 ?

AlexeyAB commented 4 years ago

@jamessmith90 Yes, you can. The more frames - the better.

snehashis1997 commented 9 months ago

@AlexeyAB @jamessmith90 how to inference a Yolo lstm model on a video? is it normal frame-by-frame or should I give a total of 16 frames at a time?

AlexeyAB / darknet

Action Recognition #4459