Open Smarter1214 opened 1 month ago
@mingxin-zheng @dongyang0122 @wyli would anyone have insights here? This may be related to multiprocessing issues, number of open files, garbage collection, the Pytorch sharing strategy, or some other technique issue. Thanks!
Thanks @Smarter1214 for finding the issue. It would be helpful if you can share some logs/outputs so that we can further pinpoint the issue
In general, I am wondering in which step the error occurs, DataAnalyzing vs Training?
When I correctly follow the steps in the auto3dseg_hello_world.ipynb notebook, set the corresponding paths and parameters, and run it in an environment with 48G of GPU memory, I encounter the error RuntimeError: Pin memory thread exited unexpectedly while attempting to train on a dataset with 300 .nii.gz images. In contrast, when using a dataset with 20 images, the training proceeds smoothly under the exact same conditions. During the training process with the 300-image dataset, I monitored the GPU memory usage and found it to be less than 70%. However, the error keeps occurring inexplicably. Could there be an issue with the get_data step?