Open jamessmith90 opened 4 years ago
maybe you can use some data from here https://research.google.com/ava/
@LukeAI I already have the dataset prepared. I just need the changes in the repo and training format.
show some examples of data - are you classifying images or detecting the locations of running-person walking-person (maybe more than one per image?)
I have coordinates and frame number tagged with category -- walking, running, standing
@jamessmith90
You can try to use LSTM-models f.e: https://github.com/AlexeyAB/darknet/files/3199770/yolo_v3_tiny_lstm.cfg.txt
Since it uses time_steps=16
in yolo_v3_tiny_lstm.cfg.txt
then
in train.txt
should be placed training images, by 16 consecutive images (frames) from the video, with this action (person is walking)
Change this line start_time_indexes[i] = ((random_gen() % m) / 16) * 16;
and recompile https://github.com/AlexeyAB/darknet/blob/5d13aad8879e1630145bb90208db518037d707a3/src/data.c#L55
May be you should leave unmarked the first 8 images, and mark only the last 8 images. This is only necessary if you want to distinguish: whether a person is moving, or the person just froze in a pose similar to movement. In theory, this should be clear after 8 frames.
More: https://github.com/AlexeyAB/darknet/issues/3114#issuecomment-494154586
For training:
Download https://pjreddie.com/media/files/yolov3-tiny.weights then do command ./darknet partial cfg/yolov3-tiny.cfg yolov3-tiny.weights yolov3-tiny.conv.14 14
- you will get yolov3-tiny.conv.14
pre-trained file.
Then train as usual:
./darknet detector train data/obj.data yolo_v3_tiny_lstm.cfg.txt yolov3-tiny.conv.14 -map
Then train both models on the same train/valid dataset and compare mAP:
Will be mAP higher in LSTM for your case?
If LSTM-model will be better, then I will add param train_seq_frames_num=16
to cfg-file, so you will not need to change the source code.
What is format for train.txt for 16 consecutive frames and what would be format of the .txt file for each image ?
Everything is the same as usual.
train.txt
video_1_frame_1.jpg
video_1_frame_2.jpg
video_1_frame_3.jpg
....
video_1_frame_16.jpg
video_2_frame_1.jpg
video_2_frame_2.jpg
...
video_2_frame_16.jpg
....
Files video_1_frame_1.txt
as usual https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
0 0.5 0.5 0.2 0.2
1 0.3 0.3 0.1 0.1
....
I have added the model for training.
I have another doubt. https://www.crcv.ucf.edu/data/UCF101.php UCF-101 has 101 categories for action recognition. Can the same logic be used for video classification ?
@AlexeyAB Is it necessary to include 16 frames ? Can i use 12 ?
@jamessmith90 Yes, you can. The more frames - the better.
@AlexeyAB @jamessmith90 how to inference a Yolo lstm model on a video? is it normal frame-by-frame or should I give a total of 16 frames at a time?
Looking to recognize if a person is walking or running. Can this be done using darknet ? If yes can you tell me what changes i need to do in the build and what would be the format of the training dataset ?