Training problem - Githubissues

OPMZZZ commented 4 months ago

Hello, when I use your program for both train.py and dae training, the training process is particularly slow. One epoch often takes more than 2 hours, while ANDi requires at least 120 epochs in the config to start saving the model. What is the situation? By the way, this is the result of selecting preload: False in the configuration after using data_preprocess.py. However, when I selected preload: True, my 128g memory was almost full at the beginning of the training phase, and after a few minutes, the training process automatically ended and I couldn't train. What's going on? Sorry to bother you, looking forward to your reply.

AlexanderFrotscher commented 4 months ago

Hey, thanks for reaching out to me. Yes, the training was also terribly slow for me when not preloading. Therefore I always used the preload option. The downside to this approach is the memory needed. I think I used more than 200GB to load the complete BraTS21 dataset into memory. Probably that is why you got the memory issues and the training stopped. I can only say that the training took place on a High Computing Cluster and therefore memory was not a problem. In the future, I plan to work with h5 files, which will store multiple images in the file in order to reduce the I/O operations that make the training terribly slow. Sorry for the inconvenience, you either have to do the update to h5 by yourself first, or wait until I do this. Alternatively, if you do not want to wait and play around with ANDI, I am happy to share the model weights with you.

OPMZZZ commented 4 months ago

Thank you for your prompt reply. I noticed that you stated in your paper that your preprocessing method is consistent with DAE. For dataset BraTS21, can I directly apply the dataloader provided by DAE to train ANDi and achieve the same effect as you provided? Because I happened to run the code for DAE and there were no training issues.

AlexanderFrotscher commented 3 months ago

Yeah this should be possible. The only difference should be the pre-processing of the ground truths masks (segmentations) for evaluation. I do not know if the DAE Dataloader also leaves out all zero slices, but this is what I have done. Leaving out all zero slices and slices having a tumor according to the mask. I will have a look if I can update the dataloader provided in this repo.

OPMZZZ commented 3 months ago

If you could update your data loading code, that would be great because it would make it easier for more people to reproduce your excellent work! In addition, the data used in the inference stage of the program provided by DAE includes all zero masks, which means the data contains completely healthy slices. I think you chose to exclude such slices in order to better verify the segmentation performance of your network, and it is also reasonable for DAE to retain such slices, because for practical application scenarios, the examined samples may not necessarily have lesions, or if they do, the lesions may not be present in every slice.

AlexanderFrotscher commented 3 months ago

I want to note after this comment that the evaluation was done on complete 3D volumes. The eval and train dataloader are different, so I also evaluated on healthy slices or even more direct on the complete brain (even zero slices). I did not leave anything out.

OPMZZZ commented 3 months ago

I am sorry that I may have misunderstood your meaning. You mean that you removed zero slices in the training and evaluation phase, while you used a complete 3D volume in the test phase. Then I think your preprocess method should be similar to DAE, and the difference may be that DAE removed zero slices in all stages. Thank you for answering my questions and looking forward to the next update of your work.

AlexanderFrotscher commented 3 months ago

All good! I could have written the comment above in a better way. Yes, I only remove zero slices and slices with a tumor during training. As soon as the evaluation, and testing begins, the complete brain is used. What I was mentioning, is the pre-processing of the labels (ground truth masks) for evaluation, and testing. The DAE repo uses bilinear downsampling for every slice of the mask, whereas I used nearest-exact on the complete volume (only for the label masks, not the actual volumes).

AlexanderFrotscher commented 3 months ago

Hey, I updated the repository and the training can now be done without preloading. The files are now stored in a LMDB in order to reduce the I/O significantly. I tested it by storing the LMDB files on a HDD, so the same result should apply to you. You need to update your environment and add lmdb. Afterward, you will need to use the split_healthy.py script again and then everything should work. For me the training of ANDi took 16h on a A100 and loading from a HDD. Please inform me if everything works for you. Then I will mark this issue as closed.

OPMZZZ commented 3 months ago

Thank you for your work. I will try running your program again soon.

OPMZZZ commented 2 months ago

The problem has been perfectly resolved

AlexanderFrotscher / ANDi

Training problem #1