How to set epochs, batchsize and models

smje commented 1 month ago

I am planning to create a model to detect eyes, nose, mouth, and ears in 3D fMRI data for the purpose of defacing. I intend to train this model using MedYOLO. My dataset consists of 250 nii.gz files per cohort (a total of 1000 files), and I do not have label data. I created label data for only one image per cohort using ITK-SNAP, resulting in a total of 4 label data files, which I then copied to match each cohort. However, the training is not going well with this approach. Due to memory issues, I reduced the dataset to 500 training files, used 100 epochs, and set the batch size to 4. How can I improve this process? Additionally, the 3D data sizes vary across cohorts (there are 4 different sizes).

JDSobek commented 1 month ago

I would use all the available data you have that isn't reserved for a test set. 100 epochs is almost certainly too few, even the small model typically needs several hundred in order to optimize, the 1000 epoch default is a good starting point. I would definitely create more labels, trying to use the same label for different examples isn't likely to give you good results unless you are extremely lucky.

Make sure you change the normalization to use the MR normalization for imaging (which will use mean and std. dev. to normalize) instead of the CT normalization (which just standardizes a wide window around fixed Hounsfield unit values). If another strategy is better for normalizing fMRI data you may want to implement that.

How much does the data size vary? Very low slice numbers cause poor results (false negatives) when most data has comparably many slices. All my data has had fixed axial resolution though, so I'm not sure how well it tolerates discrepancies there.

smje commented 1 month ago

Thank you so much for your response. This is my data size and labeling. Is there any part of your model that I could modify and train to improve my learning?

JDSobek commented 1 month ago

When you run the model you can add --norm MR to the command line, which will change the normalization. You can also use --norm other which will raise an error that will point to the line in the code where you can implement your own normalization if you need to do something more sophisticated. See: Command line parameter and Section of code to modify

I think your data looks fine size-wise.

Overall the problems are probably in part from not letting the model train long enough, since it is training from scratch, and the labeling process not providing accurate enough labels. Training longer should help, but generating high-quality labels for more of your datasets is probably the most impactful thing you can do to improve performance. You may also want to double-check your existing labels by converting some of them to masks and making sure they still line up with your ROIs in the original nifti files.

smje commented 1 month ago

I used 1500 images from 4 cohorts as training data, but since there were no labels in the training data, I manually labeled one image from each cohort and then unified the labels across all images in each cohort for training. Is this approach not valid?

Here's the error I'm encountering: autoanchor: Analyzing anchors... Metric calculation: k shape torch.Size([18, 3]), dwh shape torch.Size([4590, 3]) anchors/target = 0.58, Best Possible Recall (BPR) = 0.1532. Attempting to improve anchors, please wait... niftianchors: Running kmeans for 18 anchors on 4590 points... autoanchor: ERROR: niftianchors: ERROR: scipy.cluster.vq.kmeans requested 18 points but returned only 9 Metric calculation: k shape torch.Size([3, 6, 3]), dwh shape torch.Size([4590, 3]) autoanchor: Original anchors better than new anchors. Proceeding with original anchors.

The training proceeds, but the precision and recall values are both 0.

I think the clustering isn't working well due to a lack of diverse labeling. What can I do?

JDSobek commented 1 month ago

I used 1500 images from 4 cohorts as training data, but since there were no labels in the training data, I manually labeled one image from each cohort and then unified the labels across all images in each cohort for training. Is this approach not valid?

No, I don't think this approach will give you good results. You're feeding the model very inaccurate labels for roughly 99.9% of your dataset, which won't optimize it for predicting accurate labels on new images.

You don't need super-precise labels for MedYOLO, an accurate bounding box is going to be described by the outer edges of each ROI in each direction. What I do to make it fast is just mark the outermost X and Y values for the top slice, bottom slice, and (depending on how many slices the nifti has) every 3-10 slices in-between those two. You could use even fewer points with a little more care. Being a few voxels off with these marks doesn't really matter, so you can generate bounding box masks very quickly. Then I use a simple script to find the most extreme x/y/z values for each class and convert those into my labels.

You'll be much better off quickly generating 100 manual labels for each cohort and training the model on just the labeled images than you will be trying to use 1 label per cohort on the whole dataset.

Start with that, and see if the clustering problem still exists. The lack of diverse labeling might be why clustering isn't working, but it is probably also going to cause many other issues.

smje commented 1 month ago

I am planning to introduce and present a MedYOLO paper. In the yolo3Ds.yaml file, the model uses 2D convolutional layers by default. I am curious about how this was modified to a 3D architecture. I am having trouble understanding this from the code and would like to know more about the network structure of the model.

JDSobek commented 1 month ago

See the definition for Conv. The nomenclature follows YOLOv5, and in YOLOv5 Conv uses 2D layers, but for MedYOLO Conv has been redefined to use 3D layers. Off the top of my head I don't think there are any 2D layers in this model, only 3D or dimension-free layers.

The publication (both the article and the pre-print) linked in the readme shows some architectural diagrams for the model and most of the components. I'm pretty sure these don't really change regardless of which version/size of the model you use. I don't have a diagram for the Detect module though. The code for that module does a lot of math and is obtuse in a way that doesn't nicely translate into a diagram.

How the code was modified to be 3D... I basically started from scratch writing train.py while walking through YOLOv5's train.py, looked at each step of the process, figured out what it did, what it needed, and what it returned, checked whether it required a change in dimensionality, and then made any changes required, filling out the supporting code as it was called. Then I went through val.py and detect.py and did the same thing. I skipped some features though, like W&B and the logging that ties into it, and obviously the code to download weights on account of not having a good dataset for pretrained weights nor a good way to host any weights for download.

I wrote some code that would let me use YOLOv5 with 3D nifti datasets before starting this project, so I already had a bit of an understanding of the codebase. Overall if you understand how YOLOv5 works (which should be easier, there should be a lot more documentation out there about it), then you essentially understand how this model works. I (with some amount of effort) tried to keep them as close to exactly the same as possible despite the transition from 2D to 3D.

smje commented 1 month ago

my yaml file and labeling is wrong?

JDSobek commented 1 month ago

From what I can see the label and file organization for the test set looks correct. Is your training set organized the same way? How many files are within your validation set?

I don't typically use - characters in my filenames. I don't think there's any splitting going on in MedYOLO on any characters other than . characters, but I remember encountering issues with - characters long ago, but so long ago I can't remember what the problem was, and it's strange if all of your filenames have -'s that it still seems to have found at least 3.

Can you show me the filenames for the other 2 sets? Can you try renaming the files to have _ instead of - characters? Also have you tried deleting train.cache and val.cache to force it to search for the files again? The output from the file search process would be helpful.

smje commented 1 month ago

The issue with auto anchor remains unresolved. Even after modifying the code and successfully training, the error persists, and the P and R values continue to be zero.

smje commented 1 month ago

I was able to proceed with training after normalizing the labeled values that exceeded 1. Thank you.

JDSobek commented 1 month ago

The issue with autoanchor looks like it's because it's not seeing enough labeled examples. The code will by default calculate 18 anchors, but looks like it's only seeing 15 different bounding boxes in your dataset. There are a few lines of output above the screenshot you posted that would be informative for understanding why that is.

It's a little weird that label values exceeding 1 create an issue. The code should be able to generate predictions that are centered from -0.5 to 1.5, but I didn't have any examples that extend outside the limits of my NIfTIs, so maybe the predictions can go that far but the repo doesn't like training labels that go that far. I don't think I know of an appropriate dataset to test that unfortunately.

smje commented 1 month ago

Thank you for reply

There was an issue with the anchor grid, so I set weights_only to True, but an error occurred, so I left it as it was. Is it okay to leave it as it is?

Also, I split the data into train, val, and test sets with a ratio of 0.7:0.15:0.15 and conducted a test, but when I opened the labels, the folder was empty. I plan to do a bit more labeling and retrain it, but what do you think the outcome will be?

smje commented 1 month ago

I am curious about which tools or modules you used to train the model, apply it to the test dataset to obtain predictions in MedYOLO format, and then reapply these prediction labels to the 3D data.

smje commented 1 month ago

Thank you so much for your kind responses despite my continuous questions. After applying it to the test data, I found that multi-object detection often fails and there are many cases where the eye area is detected twice. What aspects should I improve? Is increasing the amount of data through data augmentation and additional labeling the best solution?

JDSobek commented 1 month ago

I have never seen that error with the anchor grid. Have you made any changes to the anchor code? I thought you said earlier you made some changes, if so perhaps those contributed to that error. Otherwise if you could post the full stack trace that might help diagnose it.

If MedYOLO detects objects when you run detect.py it will print the detected bounding boxes. The run you show looks like it's not detecting anything. This could be because the training set is too small, the train and test datasets are not representative of each other, or the confidence threshold for predictions could be too high for the model's detections. IIRC when I had similar issues I first tested lowering the confidence threshold but the ultimate issue was my training dataset was too small. Small datasets are a problem in multiple ways, they don't capture very much of the problem space and give the model less information to process, but also each epoch doesn't get as many chances to update the weights which means even running for many epochs can leave the model relatively unoptimized.

In my experience MedYOLO really needs a few hundred training examples for the default training parameters to have good results. Unfortunately the lack of native, efficient options for augmentation of 3D data makes it a bit hard to do heavy live augmentation. If you want to experiment with additional live augmentation, this is the part of the code where you would add it. If Torch 2.x brings us some more efficient options I'll probably try to add more.

I believe the default parameters for train.py are what I used in every test, although I usually run it with --adam to use the adam optimizer. This has the code I use to convert the text output of MedYOLO into NIfTI masks. I think I usually use that for double-checking my training labels, so if you use --save-conf on your detect.py runs it'll have a bug when you generate masks with those functions (or vice-versa, looks like one function expects confidence and the other doesn't), but it should be a simple fix.

Generally the more data you can feed the model the better (I think we needed at least 60-80 unaugmented training examples, preferably with additional augmentation). There are a few ways to manage how many detections you get from the model once it's actually detecting your ROIs. Setting --conf-thresh is one way to make it more/less selective. There's also a --max-det setting that will limit how many detections can be made for each class. I think my strategy was to let the model make as many predictions as it wanted, but since the organs I was testing could only have 1 per scan I would only use the highest confidence prediction for each class as the bounding box in downstream tasks. Functionally that's equivalent to setting --max-det 1, but has more transparency when trying to figure out what the model might be getting stuck on.

smje commented 2 weeks ago

After splitting the data into train and validation sets and training the model, how can I apply the trained model to real data using detect.py and evaluate its performance?

JDSobek commented 2 weeks ago

The most basic command line call for detect.py looks like this:

python detect.py --source /path_to_images/ --weights /path_to_model_weights/model_weights.pt --device 0 --save-txt

where /path_to_images/ is a folder that has the niftis you want to run inference on. /path_to_model_weights/ is wherever you have your model weights stored, by default when you do a training run the model results and weights get saved into a folder that looks like /runs/train/exp??. I usually copy best.pt out from the run I want to use and then rename it to something more informative like facial_features_Medyolo.pt.

--device isn't strictly needed but if you have multiple GPUs that's how you pick which one it will run on... I still haven't set up DDP so there's no point in giving it more than one.

--save-txt tells the model to save the predictions it generates into .txt files. If you want to do a test run to see if the model is detecting anything you can omit this, but if you want to save the labels for later use it. Any other argument that has action='store_true' (e.g. --half) is used like this.

detect.py has several other parameters that you might need to change or want to turn on.

Make sure --imgsz is similar to what you used when you trained your model. detect.py will use the nearest multiple of 32 to what you set (e.g. I use 350 but it reshapes to 352), but that hasn't caused me any issues so far.

Also make sure --norm corresponds to what you used to train your model.

--conf-thresh and --iou-thresh determine what predictions pass the cut-off to be considered positive. I haven't needed to mess with these since getting the framework working, but if your model isn't performing well changing these values can force it to make low confidence predictions to at least give you some idea of what's going on. --conf-thresh is probably the more useful one for this, iou-thresh is more for cases where the boxes for objects may overlap heavily.

--max-det determines how many predictions the model will make for each image. This has been most useful for me when I want to predict only 1 bounding box for narrowing down ROIs for downstream tasks.

--save-conf is the other side of the coin from --max-det I suppose. Setting this will save the confidence level of the model with each bounding box prediction. Since you can't say, for example, the images have at most 1 nose and at most 2 eyes using --max-det, you would instead have the model save out all its predictions in a txt file and then in a downstream script use only the highest confidence nose prediction and 2 highest confidence eye predictions (--iou-thresh should make sure these don't overlap).

--classes is what you use if your model is trained on many classes but you only want predictions for some of those classes.

I don't think I've ever used --agnostic-nms. Assuming YOLOv5's implementation works this probably works, but I couldn't tell you when to use it.

--project is the folder you want the script to save the txt prediction files in. By default this will be in /runs/detect/exp??

--name sets exp above to something else.

--half is whether you want to use half-precision inference. I don't remember using this with the model but it may help resolve memory issues... though inference requires far less VRAM/etc. than training. I suppose set it if you used it during training.

smje commented 4 days ago

It's certain that there is an issue with the labels, but after thoroughly checking all the labels, they don't seem to differ significantly from the labels when the code runs correctly. All six values are present, and they are normalized. However, the labels for the data that runs successfully are for MRI data that has been registered to the dimensions 181x217x181, so I normalized them using these dimensions. This time, the labels are for data in its native state, meaning the sizes are not consistent, so I had to manually retrieve the MRI sizes and normalize the labels accordingly. Despite multiple attempts, this issue keeps occurring. I would appreciate any suggestions or insights, even small ones.

JDSobek commented 3 days ago

I think there's still a problem with the anchors, not necessarily with the labels. I have an idea. This problem with kmeans not returning enough anchors showed up in one of your earlier posts so I think it might be a property of the type of data you're using, but we can reduce the number of anchors kmeans is trying to find by changing the number of anchors in the model.yaml file. Maybe if we try to get 15 or 12, or even 9, anchors instead of 18, the anchor generation code will work and not cause errors in the later code.

Can you tell me which model.yaml file you're using or copy/paste the contents of the file into your response? It's fairly easy to edit the number of anchors we're trying to find, but it would probably be good for me to show you an example.

JDSobek / MedYOLO

How to set epochs, batchsize and models #17