Open TinhinaneChekai opened 1 month ago
Hello @TinhinaneChekai , This may have something to do with the limited size of the training set. At the end of each epoch, this implementation offers a quick evaluation of the RPN's performance on data from the training and test sets. The number of elements taken into account for this evaluation in each of the sets corresponds to the EVALUATION_STEPS parameter in the config (configs/rpn/scp_rpn_config.json). In the default config file, this value is 200, so I invite you to reduce it to a value that will suit the training and test sets (at least, 1, or even 0). Let me know if the bug persists.
Hello @gdavid57, Thank you for the quick reply! Indeed, changing the config parameter EVALUATION_STEPS to 1 resolved the initial issue. However, upon proceeding to the next step, I encountered another problem: C:\Users\tinhinane.chekai\3d-mask-r-cnn>docker run -it --gpus "0" --volume C:\Users\tinhinane.chekai\3d-mask-r-cnn:/workspace gdavid57/3d-mask-r-cnn python -m main --task "TARGET_GENERATION" --config_path "configs/targeting/scp_target_config.json" Using TensorFlow backend. Training dataset is loaded. Validation dataset is loaded. TARGET GENERATION FOR train DATASET... 0it [00:00, ?it/s] TARGET GENERATION FOR test DATASET... 0it [00:00, ?it/s] It appears that the datasets were not loaded, and the target elements were not created. Thank you in advance for your assistance! Cheers
@TinhinaneChekai Given the size of the training pool, it is normal that the target generation for the test subset produces nothing (see the TARGET_RATIO parameter in the scp_target_config.json), but it should at least produce the ground truth targets for one example of the train subset.
Some questions:
1- are the data/scp_target directory, and its subdirectories, created? Are they empty?
2- do the files data/scp_target/datasets/train.csv and test.csv exist? The test.csv should only contain the csv heads, while the train.csv should in addition exhibit one example.
3- can you show me what the console prints when: you add a print(n) between lines 1801 and 1802 of core/models.py, and when you add a print(save_path) at line 1822 (same file).
For your test, I would also recommend to give TARGET_RATIO the value 1.0 in the configs/targeting/scp_target_config.json.
@gdavid57 Thank you so much for the assistance. To respond to your first and secound question, yes, the directories were created, as well as the CSV files, but they were empty. I tried setting TARGET_RATIO and ROI_POSITIVE_RATIO to 1, and that solved the issue.
Now, I am training my own volume images, and everything went well, but at the end of the MRCNN_EVALUATION, the model did not detect anything:
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) 1/1 Example name: 000004.tiff , Nb of inst.: 221 , mAP: 0.0 , Precision: 0.0 , Recall: 0.0 , Mean IoU: nan instance_nb 221.0 map-50 0.0 precision-50 0.0 recall-50 0.0 iou-50 NaN dtype: float64
-->Could this be because I am only testing 4 volumes? Also, is there a way to do the inference step on larger datasets (larger than 128x128x128 or 256x256x256)? I would appreciate your assistance. Cheers!
@gdavid57 I also adjusted the MAX_GT_INSTANCES to 300 since I have 221 instances.
@TinhinaneChekai Did the RPN evaluation stage during training give good results? The RPN detection score itself can be a limiting factor.
If your images contain more than 200 objects, I recommend that you take inspiration from the morphogenesis branch config files, particularly for the following parameters:
POST_NMS_ROIS_TRAINING: 1500 POST_NMS_ROIS_INFERENCE: 700 (or 1500 during the RPN training) MAX_GT_INSTANCES = max number of ground truth instances in your images TRAIN_ROIS_PER_IMAGE: 200 ROI_POSITIVE_RATIO: 0.33 DETECTION_MAX_INSTANCES: same than MAX_GT_INSTANCES
The ROI_POSITIVE_RATIO should not be 1.0 or the Mask R-CNN's Head won't learn to distinguish background from instances. The value 0.33 appears to be well-balanced.
What is the size of the training set? And the size of the images you're using?
In this implementation, the inference image shape must be the same than the training shape. The code may be changed to give the possibility to resize the input during inference. You won't be able to predict over bigger image without resizing because the learning weights depend on the training input shape.
@gdavid57 Below is a print of the RPN evaluation step:
Epoch 1/20
3/3 [==============================] - 680s 227s/step - loss: 1.0773 TRAIN SUBSET CLASS: 0.17591211 +/- 0.0 BBOX: 0.39138418 +/- 0.0 Mean Coordinate Error: 6.666666666666667 Detection score: 2.7149321266968327 TEST SUBSET CLASS: 0.12097141 +/- 0.0 BBOX: 0.7996556 +/- 0.0 Mean Coordinate Error: 9.333333333333334 Detection score: 0.45248868778280543 Epoch 2/20 3/3 [==============================] - 1s 326ms/step - loss: 0.8761 TRAIN SUBSET CLASS: 0.1688648 +/- 0.0 BBOX: 0.46877432 +/- 0.0 Mean Coordinate Error: 6.198412698412699 Detection score: 9.502262443438914 TEST SUBSET CLASS: 0.25516158 +/- 0.0 BBOX: 0.7494315 +/- 0.0 Mean Coordinate Error: 7.4907407407407405 Detection score: 8.144796380090497 Epoch 3/20 3/3 [==============================] - 1s 329ms/step - loss: 0.7405 TRAIN SUBSET CLASS: 0.14578626 +/- 0.0 BBOX: 0.331089 +/- 0.0 Mean Coordinate Error: 6.046296296296297 Detection score: 8.144796380090497 TEST SUBSET CLASS: 0.08000159 +/- 0.0 BBOX: 0.7742873 +/- 0.0 Mean Coordinate Error: 8.06060606060606 Detection score: 4.97737556561086 Epoch 4/20 3/3 [==============================] - 1s 325ms/step - loss: 0.6837 TRAIN SUBSET CLASS: 0.15971111 +/- 0.0 BBOX: 0.36959246 +/- 0.0 Mean Coordinate Error: 6.87962962962963 Detection score: 8.144796380090497 TEST SUBSET CLASS: 0.18048552 +/- 0.0 BBOX: 0.73191047 +/- 0.0 Mean Coordinate Error: 8.408333333333333 Detection score: 9.049773755656108 Epoch 5/20 3/3 [==============================] - 1s 320ms/step - loss: 0.7796 TRAIN SUBSET CLASS: 0.10003009 +/- 0.0 BBOX: 0.46600047 +/- 0.0 Mean Coordinate Error: 6.807017543859649 Detection score: 8.597285067873303 TEST SUBSET CLASS: 0.14494346 +/- 0.0 BBOX: 0.65259796 +/- 0.0 Mean Coordinate Error: 7.8 Detection score: 6.787330316742081 Epoch 6/20 3/3 [==============================] - 1s 327ms/step - loss: 0.5977 TRAIN SUBSET CLASS: 0.20398928 +/- 0.0 BBOX: 0.29318213 +/- 0.0 Mean Coordinate Error: 5.813725490196078 Detection score: 7.6923076923076925 TEST SUBSET CLASS: 0.12911122 +/- 0.0 BBOX: 0.7951696 +/- 0.0 Mean Coordinate Error: 8.523809523809524 Detection score: 6.334841628959276 Epoch 7/20 3/3 [==============================] - 1s 319ms/step - loss: 0.6513 TRAIN SUBSET CLASS: 0.1704017 +/- 0.0 BBOX: 0.32870704 +/- 0.0 Mean Coordinate Error: 6.046296296296297 Detection score: 8.144796380090497 TEST SUBSET CLASS: 0.21582654 +/- 0.0 BBOX: 0.74627215 +/- 0.0 Mean Coordinate Error: 7.907407407407407 Detection score: 8.144796380090497 Epoch 8/20 3/3 [==============================] - 1s 350ms/step - loss: 0.6447 TRAIN SUBSET CLASS: 0.12302536 +/- 0.0 BBOX: 0.2742539 +/- 0.0 Mean Coordinate Error: 6.086956521739131 Detection score: 10.407239819004525 TEST SUBSET CLASS: 0.15827654 +/- 0.0 BBOX: 0.6142196 +/- 0.0 Mean Coordinate Error: 8.222222222222221 Detection score: 8.144796380090497 Epoch 9/20 3/3 [==============================] - 1s 340ms/step - loss: 0.5417 TRAIN SUBSET CLASS: 0.09913535 +/- 0.0 BBOX: 0.2536567 +/- 0.0 Mean Coordinate Error: 6.2894736842105265 Detection score: 8.597285067873303 TEST SUBSET CLASS: 0.12607735 +/- 0.0 BBOX: 0.5604667 +/- 0.0 Mean Coordinate Error: 8.134920634920634 Detection score: 9.502262443438914 Epoch 10/20 3/3 [==============================] - 1s 343ms/step - loss: 0.4339 TRAIN SUBSET CLASS: 0.12459881 +/- 0.0 BBOX: 0.55658585 +/- 0.0 Mean Coordinate Error: 6.907407407407407 Detection score: 8.144796380090497 TEST SUBSET CLASS: 0.11567438 +/- 0.0 BBOX: 0.5071953 +/- 0.0 Mean Coordinate Error: 7.5 Detection score: 10.85972850678733 Epoch 11/20 3/3 [==============================] - 1s 331ms/step - loss: 0.6007 TRAIN SUBSET CLASS: 0.16256522 +/- 0.0 BBOX: 0.43256217 +/- 0.0 Mean Coordinate Error: 6.181818181818182 Detection score: 9.95475113122172 TEST SUBSET CLASS: 0.1472944 +/- 0.0 BBOX: 0.5031514 +/- 0.0 Mean Coordinate Error: 6.956521739130435 Detection score: 10.407239819004525 Epoch 12/20 3/3 [==============================] - 1s 325ms/step - loss: 0.6027 TRAIN SUBSET CLASS: 0.17128782 +/- 0.0 BBOX: 0.28978264 +/- 0.0 Mean Coordinate Error: 6.5964912280701755 Detection score: 8.597285067873303 TEST SUBSET CLASS: 0.08558462 +/- 0.0 BBOX: 0.4808194 +/- 0.0 Mean Coordinate Error: 7.1419753086419755 Detection score: 12.217194570135746 Epoch 13/20 3/3 [==============================] - 1s 328ms/step - loss: 0.5497 TRAIN SUBSET CLASS: 0.15374395 +/- 0.0 BBOX: 0.27516133 +/- 0.0 Mean Coordinate Error: 6.454545454545454 Detection score: 9.95475113122172 TEST SUBSET CLASS: 0.17221825 +/- 0.0 BBOX: 0.5755338 +/- 0.0 Mean Coordinate Error: 7.136363636363637 Detection score: 9.95475113122172 Epoch 14/20 3/3 [==============================] - 1s 337ms/step - loss: 0.5009 TRAIN SUBSET CLASS: 0.15034553 +/- 0.0 BBOX: 0.19508827 +/- 0.0 Mean Coordinate Error: 5.674603174603175 Detection score: 9.502262443438914 TEST SUBSET CLASS: 0.08754537 +/- 0.0 BBOX: 0.4194887 +/- 0.0 Mean Coordinate Error: 6.992753623188406 Detection score: 10.407239819004525 Epoch 15/20 3/3 [==============================] - 1s 330ms/step - loss: 0.3680 TRAIN SUBSET CLASS: 0.07706507 +/- 0.0 BBOX: 0.15356563 +/- 0.0 Mean Coordinate Error: 5.882352941176471 Detection score: 7.6923076923076925 TEST SUBSET CLASS: 0.0531558 +/- 0.0 BBOX: 0.39129162 +/- 0.0 Mean Coordinate Error: 7.583333333333333 Detection score: 11.764705882352942 Epoch 16/20 3/3 [==============================] - 1s 354ms/step - loss: 0.2821 TRAIN SUBSET CLASS: 0.03422639 +/- 0.0 BBOX: 0.25111577 +/- 0.0 Mean Coordinate Error: 6.333333333333333 Detection score: 9.95475113122172 TEST SUBSET CLASS: 0.101514086 +/- 0.0 BBOX: 0.328408 +/- 0.0 Mean Coordinate Error: 7.396825396825397 Detection score: 9.502262443438914 Epoch 17/20 3/3 [==============================] - 1s 328ms/step - loss: 0.3094 TRAIN SUBSET CLASS: 0.035112094 +/- 0.0 BBOX: 0.10316532 +/- 0.0 Mean Coordinate Error: 6.253623188405797 Detection score: 10.407239819004525 TEST SUBSET CLASS: 0.03255923 +/- 0.0 BBOX: 0.29526517 +/- 0.0 Mean Coordinate Error: 6.65 Detection score: 9.049773755656108 Epoch 18/20 3/3 [==============================] - 1s 336ms/step - loss: 0.1984 TRAIN SUBSET CLASS: 0.028368667 +/- 0.0 BBOX: 0.06045332 +/- 0.0 Mean Coordinate Error: 6.027777777777778 Detection score: 10.85972850678733 TEST SUBSET CLASS: 0.021655822 +/- 0.0 BBOX: 0.2823986 +/- 0.0 Mean Coordinate Error: 6.72463768115942 Detection score: 10.407239819004525 Epoch 19/20 3/3 [==============================] - 1s 339ms/step - loss: 0.2027 TRAIN SUBSET CLASS: 0.026995007 +/- 0.0 BBOX: 0.10250036 +/- 0.0 Mean Coordinate Error: 5.841269841269841 Detection score: 9.502262443438914 TEST SUBSET CLASS: 0.014510307 +/- 0.0 BBOX: 0.26509133 +/- 0.0 Mean Coordinate Error: 7.160256410256411 Detection score: 11.764705882352942 Epoch 20/20 3/3 [==============================] - 1s 319ms/step - loss: 0.1874 TRAIN SUBSET CLASS: 0.016757848 +/- 0.0 BBOX: 0.06115727 +/- 0.0 Mean Coordinate Error: 5.412698412698413 Detection score: 9.502262443438914 TEST SUBSET CLASS: 0.0534655 +/- 0.0 BBOX: 0.22588122 +/- 0.0 Mean Coordinate Error: 6.680555555555555 Detection score: 10.85972850678733
Noted for the config file of the morphogenesis branch and the ROI_POSITIVE_RATIO.
For now, I have downsampled my volume images to 128x128x128 to test the model, but the original shape is 2153x2153x2153. Is it possible to change the shape of the input data during training to obtain the adequate weights?
Cheers
@TinhinaneChekai Thanks for you reply.
In this case, the evaluation shows that the detection score at the end of the 20th epoch is 9.5% on the training set, which isn't strange, given that the number of training data is not big enough to train a Mask R-CNN.
Given that the original size of your images is large, my advice for obtaining usable results would be to resize them to 1024x1024x1024 (or keep them as they are) and then extract 128x128x128 (or 256x256x256) patches to train the Mask R-CNN. These patches can also be augmented (see the data augmentation script in the morphogenesis branch) by a factor of 48. This gives access to 8x8x8 (or 4x4x4) x3 x48 training images, which should be more than enough to train this network.
This resize+patch approach also allows to work with fewer instances per image, which means that the Mask R-CNN can be trained much more quickly. The larger the IMAGE_SIZE, POST_NMS_ROIS_TRAINING or TRAIN_ROIS_PER_IMAGE parameters, the more time-consuming the training...
For prediction on a real image, all you have to do is (1) resize it to 1024x1024x1204, (2) predict over all the patches, (3) merge the results and (4) resize this merge to the original format.
This patch prediction is fairly common, and algorithms must already exist on other repositories.
@gdavid57
Thank you so much! I will follow your guidance. One more detail: my instances shapes range approximately from 25x25x25 to 700x700x700. My question is, when creating patches from the 700x700x700 shape, will it not lead to an overestimation of the number of instances when merging them back together?
Thank you again.
Cheers!
To be honest, I'm not sure that the Mask R-CNN will be able to segment objects with such different scales. I'm quite confident for the "little" instances, but it can be difficult for the largest ones, in particuler with a patched approach. You can try of course, but you may also consider to use two different detectors (at different resizing shapes), or to look for another network/method (like the Segment Anything Model ?).
To answer your question: some merging algorithms work with overlapping patches, so they are able to produce consistent labels.
@gdavid57
Thank you again for your assistance. I will try it with downsampled data to avoid using a patched approach. Yes, SAM seems like a good option if I can find the right 3D adaptation that works for my data. Thank you again for being so quick to respond ! Cheers
@gdavid57
I have tested the model with additional datasets of size 128x128x128 and MAX_GT_INSTANCES set to 411. Everything went smoothly until I reached the Mask R-CNN evaluation, where I encountered the following error:
2024-10-14 10:29:11.543600: F tensorflow/stream_executor/cuda/cudadnn.cc:534] Check failed: cudnnSetTensorNdDescriptor(handle.get(), elem_type, nd, dims.data(), strides.data()) == CUDNN_STATUS_SUCCESS (9 vs. 0)batch_descriptor: {count: 411 feature_map_count: 256 spatial: 28 28 28 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX}.
Sorry for the multiple questions!
Cheers!
@TinhinaneChekai
Without more informations, I would say it is either a failure of memory allocation or input size. What is the input size? Maybe 441 as MAX_GT_INSTANCES is too much. Difficult to say.
@gdavid57 Thank you for your response. The input size is 128x128x128. I am using an NVIDIA RTX A5500 GPU with 22.5 GB of memory. Below is the output before the crash. Cheers!
@TinhinaneChekai Did you use another value of MAX_GT_INSTANCES during training?
No not at all, I have always kept MAX_GT_INSTANCES=411
Maybe its value is too important when dealing with both RPN and Head. I used the Mask R-CNN on 256x256x256 input and I couldn't use more than 371 for MAX_GT_INSTANCES and DETECTION_MAX_INSTANCES on a 32GB V100. I would say it is a memory issue.
@gdavid57 Okay, I will test on fewer instances and see if the problem is resolved. I’ll keep you updated. Cheers!
Hello, I find this work very nice and that it could be of great help for my work. I have tried to run the model using the toy generated data. Evreything goes well untill I launch the training where I am faced with this error: is this error: Epoch 1/20
9/9 [==============================] - 399s 44s/step - loss: 1.4886 TRAIN SUBSET /usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice. out=out, *kwargs) /usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/workspace/main.py", line 37, in
rpn.train()
File "/workspace/core/models.py", line 1737, in train
use_multiprocessing=True,
File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func( args, *kwargs)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1732, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 260, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/lib/python3.6/dist-packages/keras/callbacks/callbacks.py", line 152, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/workspace/core/models.py", line 1432, in on_epoch_end
rpn_evaluation(self.model, self.config, ["TRAIN SUBSET", "TEST SUBSET"], [self.train_dataset, self.test_dataset], self.check_boxes)
File "/workspace/core/utils.py", line 725, in rpnevaluation
inputs, = generator.getitem(k)
File "/workspace/core/data_generators.py", line 76, in getitem
return self.data_generator(self.image_ids[idx self.batch_size:(idx + 1) * self.batch_size])
File "/workspace/core/data_generators.py", line 82, in data_generator
image_id = image_ids[b]
IndexError: index 0 is out of bounds for axis 0 with size 0
Could it be the fact that I am only testing 10 images ? or is there another ajustement that I should make ?
Cheers