carbonscott / maxie

Masked Autoencoder for X-ray Image Encoding (MAXIE)
Other
2 stars 4 forks source link

Training gets stuck due to None batch #10

Closed carbonscott closed 4 months ago

carbonscott commented 4 months ago

Create a dummy tensor of zero to substitute a None batch so that no weight (not bias though) update will occur.

carbonscott commented 4 months ago

Commit 125471406ea519d60f967ee1b1e8c460e518be5b solved this issue.

carbonscott commented 4 months ago

This issue was discovered when round robin scheduling (see #9 ) with the entry_per_cycle = 1 was applied.

carbonscott commented 4 months ago

Example reporting from a rank that saw a None batch:

05/25/2024 20:06:48 DEBUG __main__
[RANK 9] Start processing 4 batches at epoch 0, seg 0.
05/25/2024 20:07:50 DEBUG maxie.datasets.ipc_segmented_dataset_dist
[RANK 9] exp=cxil1022721, run=82, detector_name=jungfrau4M, event=0.
05/25/2024 20:08:53 DEBUG maxie.datasets.ipc_segmented_dataset_dist
[RANK 9] exp=cxily5921, run=21, detector_name=jungfrau4M, event=0.
05/25/2024 20:09:49 DEBUG maxie.datasets.ipc_segmented_dataset_dist
[RANK 9] exp=mfxx49820, run=86, detector_name=epix10k2M, event=0.
05/25/2024 20:09:50 DEBUG maxie.datasets.ipc_segmented_dataset_dist
Server error: Received None from exp=cxily5921, run=1, event=0!!!
05/25/2024 20:09:50 DEBUG maxie.datasets.ipc_segmented_dataset_dist
[RANK 9] exp=cxily5921, run=1, detector_name=jungfrau4M, event=0.
05/25/2024 20:09:50 DEBUG __main__
[RANK 9] Found None batch at batch idx 3.  Creating a dummy input!!!