WongKinYiu / yolov7

Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
GNU General Public License v3.0
13.26k stars 4.18k forks source link

How to interpret the yolo-tiny outputs? #1093

Open hugozanini opened 1 year ago

hugozanini commented 1 year ago

I converted the Yolo-tiny model to tensorflow.js but I'm not being able to interpret the outputs.

When I run an image through the model, I got a response in the format [1, 25200, 85], and iterating over the 25200 rows what I understood is that the first 4 items are the bounding box coordinates of the detection, the fifth item is the detection confidence and the next 80 rows indicate the confidence of every class. As the example below:

issue-yolojs

I read the codes of export.py and utils/general.py to try to understand the max_supression logic and how to interpret the predictions but I didn't get it. I tried to use strategies of denormalization of the values using the original image shape and used the coordinates to get the center of the image, but any of these worked.

Is there any documentation that I can refer to interpret the predictions properly?

Output example:

Array 0/25200: [ 3.660677909851074, 3.8960976600646973, 7.1445159912109375, 8.558195114135742, 0.000002291132886966807, 0.2992437183856964, 0.0032765434589236975, 0.02299974113702774, 0.002288553863763809, 0.0061205169185996056, 0.000405748636694625, 0.0007168060401454568, 0.006684356834739447, 0.010973624885082245, 0.010580179281532764, 0.0013355360133573413, 0.0024683668743819, 0.0015576096484437585, 0.016338417306542397, 0.06432975828647614, 0.002155845519155264, 0.002606399590149522, 0.008280608803033829, 0.024560092017054558, 0.011779602617025375, 0.008507341146469116, 0.0006727887666784227, 0.010439596138894558, 0.009805492125451565, 0.014551358297467232, 0.00901725422590971, 0.010406507179141045, 0.006617129780352116, 0.0035439676139503717, 0.005152086261659861, 0.020896468311548233, 0.006204261444509029, 0.04126130789518356, 0.027140766382217407, 0.003251225920394063, 0.0019718394614756107, 0.007059866562485695, 0.028940090909600258, 0.005898833740502596, 0.01423275750130415, 0.007057651877403259, 0.03938567265868187, 0.01166496705263853, 0.010686900466680527, 0.005906108301132917, 0.005354586057364941, 0.003930031321942806, 0.005226451903581619, 0.0004987830179743469, 0.007237072102725506, 0.01963111199438572, 0.006294747814536095, 0.0008835819317027926, 0.0004639460239559412, 0.0038057370111346245, 0.0016457928577437997, 0.0632367804646492, 0.0031223613768815994, 0.012071071192622185, 0.0007920170319266617, 0.0067767915315926075, 0.007115103304386139, 0.002724584424868226, 0.0012104857014492154, 0.001585118006914854, 0.0028675436042249203, 0.001451255171559751, 0.0055689564906060696, 0.0007458814070560038, 0.0007105154800228775, 0.000056244785810122266, 0.010288779623806477, 0.002680464880540967, 0.013829641975462437, 0.007938055321574211, 0.007399112917482853, 0.0017575552919879556, 0.0013826033100485802, 0.0002145568432752043, 0.0031385323964059353 ]

guillermoecn commented 1 year ago

The non_max_supresion function should return you many fewer values, then you should use the scale_coords function so it will translate the coordinates from the shape of model input (640x640, 416x416, etc.) to your original image size.

You can read the predict.py or detect.py scripts and adapt the code to your needs, since these have a pipeline for predictions