DataCTE / SDXL-Training-Improvements

Apache License 2.0
39 stars 0 forks source link

Training fails with OOM on batch_size > 1 despite 80GB VRAM cards #1

Open DataCTE opened 6 days ago

DataCTE commented 6 days ago

Description

When running training with batch_size greater than 1, the process runs out of memory despite testing on high-VRAM cards (80GB). This appears to be a memory management issue rather than a VRAM capacity limitation.

Steps to Reproduce

  1. Set up training with default parameters
  2. Set batch_size=2 or higher
  3. Start training
  4. Process crashes with OOM error

Current Behavior

Expected Behavior

Potential Investigation Areas

  1. Memory profiling during batch processing
  2. Gradient accumulation implementation
  3. Cache clearing between batches
  4. Model state handling in CustomDataset.__getitem__

Notes

The issue might be related to how tensors are handled in the custom_collate function or how the model states are managed during forward passes. Memory profiling tools like torch.cuda.memory_summary() could help identify the leak point.