AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

ELI5: How does YOLO detect objects in the input image? #2389

Open OndoyManing opened 5 years ago

OndoyManing commented 5 years ago

For those of you who don't know, Explain Like I'm 5 (ELI5) is a famous subreddit where people will ask questions and the answers should be "understandable" by 5-year-olds.

How would you explain how yolo detects the object/s in the input image to someone who has no idea of YOLO/object detection in general?

PeterQuinn925 commented 5 years ago

It's a magic box that uses math. There are a bunch of great articles on neural nets, but you have to understand some math for it to be more than magic.

aniketvartak commented 5 years ago

yes, it is a lot of math. The closed I have seen it explain (probably still not 5yo. level but close) is Prof. Ng's lecture on it: https://www.youtube.com/watch?v=3Pv66biqc1E

AlexeyAB commented 5 years ago

If we talk quite childish:

In general Deep Neaural Network (DNN) just remembers all the images that were seen during training and the coordinates and class_id of objects by using optimal lossy compression - so it only remembers the most important details. So it can predict objects (+-~15% size/color/...) that it saw during training in training and augmented images.

Trained-weights: Yolo uses convolutional layers, so you can think about trained-weights as:

Forward-inference: During detection the each convolutional layer just compare these puzzles with each place on the image, and and outputs the degree of coincidence (predict probability of presence of this object).

More about it: https://github.com/AlexeyAB/darknet/issues/796#issuecomment-388553709

OndoyManing commented 5 years ago

It's a magic box that uses math. There are a bunch of great articles on neural nets, but you have to understand some math for it to be more than magic.

Love that! Hahaha I usually say "An input image goes through a network and poof! objects will be detected."

OndoyManing commented 5 years ago

If we talk quite childish:

In general Deep Neaural Network (DNN) just remembers all the images that were seen during training and the coordinates and class_id of objects by using optimal lossy compression. So it can predict objects (+-~15% size/color/...) that it saw during training in training and augmented images.

Trained-weights: Yolo uses convolutional layers, so you can think about trained-weights as:

  • 1st convolutionla layer (with params size=3 filters=16) collects 16 puzzles each consists of 3x3 black or white points
  • 2nd convolutionla layer (with params size=3 filters=32) collects 32 puzzles each consists of 3x3 collected puzzles from the 1st layer (filters from the 1st layer) ....
  • Last convolutional layer collect several puzzles (which look likes desired objects) each consists of 3x3 puzzles from previous layer where each puzzle consists of 3x3 puzzles from prev-previous layer ....

Forward-inference: During detection the each convolutional layer just compare these puzzles with each place on the image, and and outputs the degree of coincidence (predict probability of presence of this object).

More about it: #796 (comment)

Awesome! I think my nephew will be able to get the concept with that kind of explanation.