When running training with batch_size greater than 1, the process runs out of memory despite testing on high-VRAM cards (80GB). This appears to be a memory management issue rather than a VRAM capacity limitation.
Steps to Reproduce
Set up training with default parameters
Set batch_size=2 or higher
Start training
Process crashes with OOM error
Current Behavior
Process crashes with OOM error when batch_size > 1
Memory usage grows unexpectedly
Expected Behavior
Should be able to handle batch sizes > 1 on 80GB VRAM cards
Potential Investigation Areas
Memory profiling during batch processing
Gradient accumulation implementation
Cache clearing between batches
Model state handling in CustomDataset.__getitem__
Notes
The issue might be related to how tensors are handled in the custom_collate function or how the model states are managed during forward passes. Memory profiling tools like torch.cuda.memory_summary() could help identify the leak point.
Description
When running training with
batch_size
greater than 1, the process runs out of memory despite testing on high-VRAM cards (80GB). This appears to be a memory management issue rather than a VRAM capacity limitation.Steps to Reproduce
batch_size=2
or higherCurrent Behavior
Expected Behavior
Potential Investigation Areas
CustomDataset.__getitem__
Notes
The issue might be related to how tensors are handled in the
custom_collate
function or how the model states are managed during forward passes. Memory profiling tools liketorch.cuda.memory_summary()
could help identify the leak point.