Error just after training is completed

neel04 commented 2 years ago

Hi, thanks for such a great repo! 🤗 I wanted to train YOLOr on my own custom data. This is the command I am using, in Colab:-

#Set Epochs
EPOCHS = 30
BATCH_SIZE = 32
!python /content/yolor/train.py --batch-size $BATCH_SIZE --img 512 512 --data /content/yolor/data/coco.yaml --cfg /content/yolor/cfg/yolor_p6.cfg --weights '' --device 0 --name yolor_p6 --epochs $EPOCHS --adam --cache-images

However, just after training finishes I get this error:-

Using torch 1.7.0 CUDA:0 (Tesla P100-PCIE-16GB, 16280MB)

Namespace(adam=True, batch_size=16, bucket='', cache_images=False, cfg='/content/yolor/cfg/yolor_p6.cfg', data='/content/yolor/data/coco.yaml', device='0', epochs=50, evolve=False, exist_ok=False, global_rank=-1, hyp='./yolor/data/hyp.scratch.1280.yaml', image_weights=False, img_size=[512, 512], local_rank=-1, log_imgs=16, multi_scale=False, name='yolor_p6', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/yolor_p6', single_cls=False, sync_bn=False, total_batch_size=16, weights='', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.5, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Model Summary: 665 layers, 37265016 parameters, 37265016 gradients, 81.564040600 GFLOPS
Optimizer groups: 145 .bias, 145 conv.weight, 149 other

Scanning images: 100% 5400/5400 [00:00<00:00, 5509.28it/s]
Scanning labels /content/train_yolo/labels/val.cache3 (5400 found, 0 missing, 0 empty, 2 duplicate, for 5400 images): 5400it [00:00, 7996.85it/s]
Scanning images: 100% 301/301 [00:00<00:00, 3736.10it/s]
Scanning labels /content/val_yolo/labels/val.cache3 (301 found, 0 missing, 0 empty, 0 duplicate, for 301 images): 301it [00:00, 3673.14it/s]
NumExpr defaulting to 4 threads.
Images sizes do not match. This will causes images to be display incorrectly in the UI.
Image sizes 512 train, 512 test
Using 4 dataloader workers
Logging results to runs/train/yolor_p6
Starting training for 50 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      0/49     9.66G   0.05845   0.05395  0.001509    0.1139        16       512: 100% 338/338 [02:54<00:00,  1.94it/s]

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      1/49     9.79G   0.04292   0.04701  0.001316   0.09124        25       512: 100% 338/338 [02:36<00:00,  2.16it/s]

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      2/49     9.79G   0.03832   0.04282   0.00132   0.08246        16       512: 100% 338/338 [02:30<00:00,  2.24it/s]

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      3/49     9.79G    0.0347   0.03967  0.001317   0.07569        21       512: 100% 338/338 [02:28<00:00,  2.27it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 10/10 [00:05<00:00,  1.83it/s]
                 all         301         392       0.201       0.794       0.317        0.19

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      4/49     6.29G   0.03113   0.03582  0.001313   0.06826        30       512: 100% 338/338 [02:28<00:00,  2.27it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 10/10 [00:03<00:00,  2.88it/s]
                 all         301         392       0.212       0.887       0.381       0.233

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      5/49     6.29G   0.02992   0.03439  0.001319   0.06563        19       512: 100% 338/338 [02:28<00:00,  2.28it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 10/10 [00:03<00:00,  2.94it/s]
                 all         301         392       0.219       0.897       0.402       0.278

...........

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
     49/49     6.29G   0.01612   0.02269  0.001305   0.04012        18       512: 100% 338/338 [02:28<00:00,  2.27it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95:   0% 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/yolor/train.py", line 537, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "/content/yolor/train.py", line 344, in train
    log_imgs=opt.log_imgs if wandb else 0)
  File "/content/yolor/test.py", line 167, in test
    "domain": "pixel"} for *xyxy, conf, cls in pred.tolist()]
  File "/content/yolor/test.py", line 167, in <listcomp>
    "domain": "pixel"} for *xyxy, conf, cls in pred.tolist()]
TypeError: list indices must be integers or slices, not float

This is a sample of how the bounding boxes in the dataset look like:

1 0.547852 0.704590 0.484375 0.590820

where 1 is the class index presumably.

Does anyone know what may be the cause of this error?

neel04 commented 2 years ago

Also getting this, strangely enough after some modifications 🤔

FileNotFoundError: [Errno 2] No such file or directory: '/content/runs/train/yolor_p6/precision-recall_curve.png'

dripdropdr commented 2 years ago

Did you solve this? I have the same problem, too. :(

zggg1p commented 2 years ago

I have the same problem, too. :(

Timmimim commented 2 years ago

In case anybody still encounters the same issue:

I fixed the initial issue of float indices by simply casting cls to integer where it is used as list index:

[...],
box_caption": "%s %.3f" % (names[int(cls)], conf), `
[...]

Before moving on, while we are in the W&B Logging part (lines 161-169): The names variable might need to be a dict type for some versions of wandb (apparently; at least for me there was an error). To fix the issue before it arises, edit the code to something like this:

# W&B logging
if plots and len(wandb_images) < log_imgs:
    box_data = [{"position": {"minX": xyxy[0], "minY": xyxy[1], "maxX": xyxy[2], "maxY": xyxy[3]},
                    "class_id": int(cls),
                    "box_caption": "%s %.3f" % (names[int(cls)], conf),
                    "scores": {"class_score": conf},
                    "domain": "pixel"} for *xyxy, conf, cls in pred.tolist()]
    # if necessary, create a dict using list indices as keys, so it can be queried almost exactly like a list 
    if type(names) == type([]):
        names_dict = {idx:val for idx, val in enumerate(names)}
        boxes = {"predictions": {"box_data": box_data, "class_labels": names_dict}}
    else:
        boxes = {"predictions": {"box_data": box_data, "class_labels": names}}
wandb_images.append(wandb.Image(img[si], boxes=boxes, caption=path.name))

The second issue (FileNotFoundError: [Errno 2] No such file or directory: 'runs/train/<run_dir>/precision-recall_curve.png') is the result of no validations being performed for less than 3 training epochs. This is defined in train.py line 336 (if epoch >= 3:). If the test()method from test.py has not been called before, the relevant images are not prepared and thus non-existent. At least that was the case for me. I assume you reduced your # epochs for test runs?

I encountered several more issues down the road. I had compatibility issues with PyTorch v1.12, which were easily resolved thanks to the code provided in #270.

I had to adjust the number of classes for my custom data, and subsequently the number of filters in several layers of the architecture as described in the respective <architecture>.cfg files. Examples can be found in #16 and #251.

Finally, there was another issue in utils/plot.py that kept me busy, but might also be a compatibility issue with PyTorch v1.12. I kept getting (illogical) errors for a list-type object that was somehow a CUDA Tensor, but should not be. Somewhere under the hood, some data is not properly converted. So in the method output_to_target() (lines 89-108), the target variable is not a simple list, but a CUDA Tensor (or includes CUDA Tensors). These MUST be moved to CPU memory. So I ended up editing the _tensor.py file in my PyTorch installation. The Tensor class has a method __array__() used for implicit type casting (lines 753-761 in my installation). I added the following code ahead of the if-clauses handling the two possible return statements:

if self.is_cuda:
    self = self.cpu()

I hope that covers all issues you might have. I thought it would be good to write a small summary of my problems today, so others won't have to waste half a day. Have a good one! :)

WongKinYiu / yolor

Error just after training is completed #136