Training process issue - Githubissues

Jerry7j commented 9 months ago

Dear JDSobek,

First, I want to say thanks for the help for my data processing.

I am running MedYOLO now, maybe there is something wrong with it. I have run 50 epoch. But I see the P, R and mAP always be zero.

 Epoch   gpu_mem       box       obj       cls    labels  img_size
51/499     17.9G     0.204 0.0006677         0         2       350: 100%|███████████████████████████████████████████████████████| 54/54 [04:43<00:00,  5.25s/it]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|███████████████████████████████████████| 58/58 [01:06<00:00,  1.15s/it]
             all        230          0          0          0          0          0

Here is my example.yaml setting and preset constant in train.py.

example.yaml number of classes nc: 1 class names names: ['L']

train.py. LOCAL_RANK = int(os.getenv('LOCAL_RANK', 0))
RANK = int(os.getenv('RANK', 0)) WORLD_SIZE = 4 default_size = 350 default_epochs = 500 default_batch = 4

And there is a note that my dataset maskvalue is 3, but it is only one class in my data. Shoud I set the maskvalue to 1? So, when I first run the train.py I have this problem in: https://github.com/JDSobek/MedYOLO/blob/e734b9afefd0341a490e2179eba98221d40bce64/train.py#L169

Then I give a comment for this line then I can run the train.py.

And I use the temporary weights for detect, there is no result .txt. Is it related in original data's maskvalue? Or, should I remake a dataset which contains the three classes 1 2 3?

Best, Jerry

JDSobek commented 9 months ago

Following YOLOv5, the first class in every dataset is class 0, not class 1.

What I did for my single category tasks was set the class value to 1 in the labels (Change mask_value to 1 on this line), and in the data.yaml file I would set nc=2 with a dud for class 0 while I set class 1 as whatever the class I was interested in was.

E.g. for BRaTS whole tumor detection I would have:

nc: 2
names: ['NA', 'Tumor']

You could instead set the class value as 0 in the txt labels and everything from there should behave how you'd expect. You'll just have to remember to set the value appropriately if/when you use the predictions for downstream tasks.

Setting nc=4 and having dud classes for 0, 1, 2 should also work, but it's going to be generally less bothersome for you if you set the class values in your label to 0 or 1 for single category tasks.

Jerry7j commented 9 months ago

I'm so appreciated for your suggestion. It contains all situation which going on. I will try to instead set the class value as 0, and set nc=4 having dud classes for 0, 1, 2.

Merry Christmas！

Jerry7j commented 9 months ago

I have tried that instead set the class value as 1 and set nc=2 with a dud for class 0, set the class value as 0, and set nc=4 having dud classes for 0, 1, 2. But the P R mAP always being 0.

 Epoch   gpu_mem       box       obj       cls    labels  img_size

186/199 5.11G 0.2674 0.08249 0.0259 1 350: 100%|██████████| 1/1 [00:02<00:00, 2.54s/it]

           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:   0%|          | 0/1 [00:00<?, ?it/s]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 1/1 [00:00<00:00,  1.06it/s]
           Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████| 1/1 [00:00<00:00,  1.06it/s]
             all          1          0          0          0          0          0

I'm trying other data.

JDSobek commented 9 months ago

Hmm. How many training/validation examples do you have? Above it looked like there were around 200, now it looks like there's only one.

Also, can you give me more details on what your dataset is? The spinal column from earlier or something else?

My suspicion is there either aren't enough examples or something is still wrong with the labels.

Jerry7j commented 9 months ago

Yeah. They are spinal column. Above dataset have about 2000, this data I just use 10 for train and val. I will try to return 2000 tomorrow. If it still doesn't work, I will check the label. Thanks！

JDSobek commented 9 months ago

Ok. One of my trials included vertebral disc detection and it did OK but not great, so it should be able to find a signal for the spinal column.

You should check your labels first, since it's a lot faster to remake 10 labels vs 2000. If those are fine the problem is most likely that you just aren't using enough data. For scratch training it really needs a few hundred examples (~300 was the least number I used for training) over several hundred epochs (the fastest run ended around epoch 700 IIRC).

The framework uses a one-cycle training policy so the relationship between epochs, batches, and learning rate isn't as straightforward as some other LR schedules. Sometimes it takes awhile during the warm-up period before it starts figuring out what it's looking for, so it's important the model has enough opportunities to update the weights early on and when the LR is high.

Jerry7j commented 9 months ago

Thank you for such a detail suggestion! I'm checking the label first. Then use few hundred examples for training.

Jerry7j commented 9 months ago

Hi, Sobek. Thanks for your guidance. I have got a result from a dataset about 2000 examples using the yolom model. It seems ok. I use 4 test data, there are one result txt return (with the confidence of 0.469739). Here it is from the maks_maker: view

I gonna use the yolol model, but my GPU(4*RTX 4090) may not enough.. When the yolom model batch size set 4, the gpu memory is full. I'll set the batch size 2 or 1 for yolol.

And I want ask whether I chage the spacing of origin data (like the 1.5mm ---> 3mm or 6mm) for training will affect the result?

JDSobek commented 9 months ago

And I want ask whether I chage the spacing of origin data (like the 1.5mm ---> 3mm or 6mm) for training will affect the result?

I didn't have much opportunity to test different slice spacing. If I had to guess, if the overall image "looks" mostly the same to you it shouldn't have too much of a different result, because it's going to be reshaped into a fixed size cube before being fed to the model. If the quality or other image characteristics are very different, that might affect the result.

I think the main thing that will give you better results right now is letting the model train for longer. My training runs were for 1000 epochs (the default setup) and the best results came from fairly late in the training process (epochs 700+). I'd recommend you try yolo3Ds and let it train for 1000 epochs (or until early stopping activates) with whatever batch size you can fit.

Jerry7j commented 9 months ago

Yeah. The cube will bring a similar result. I have set the train epoch for 1000.

The batch size is 4, but it only can be used in the yolo3Ds and yolo3Dm model. The yolo3Dl will give me a cuda out of memory. If I rectify the batch size smaller than 4, it is not proper for my GPUs(4*RTX 4090). https://github.com/JDSobek/MedYOLO/blob/e734b9afefd0341a490e2179eba98221d40bce64/train.py#L419

JDSobek commented 9 months ago

How was the result?

Jerry7j commented 8 months ago

Because my gpu memory, I can only use the yolom model. The result is nice! The first time I make the data by a small region which only contains vertebral column. That's my mistake. Then I had reproduced the data containing the whole body CT.

You made a gorgeous project. Thank you so much for the guidance!

JDSobek commented 8 months ago

Thank you. I'll close the issue then.

There is one last thing you can try. The model3D yaml files let you customize the size of the models using the depth_multiple and width_multiple parameters. YOLOv5 has a "nano" model with width_multiple: 0.25, as opposed to the small model with width_multiple: 0.5. You could create a MedYOLO nano model just by copying the small model yaml and changing that parameter's value, or create a size between the medium and large models by modifying those two values (although the large model refuses to converge for some datasets so your mileage may vary).

I haven't tested anything smaller than the small model, so I can't attest to performance or how much VRAM it'll actually save. I'm pretty sure the code will still run though.

JDSobek / MedYOLO

Training process issue #7