Training fails with OOM on batch_size > 1 despite 80GB VRAM cards

Description

When running training with batch_size greater than 1, the process runs out of memory despite testing on high-VRAM cards (80GB). This appears to be a memory management issue rather than a VRAM capacity limitation.

Steps to Reproduce

Set up training with default parameters
Set batch_size=2 or higher
Start training
Process crashes with OOM error

Current Behavior

Process crashes with OOM error when batch_size > 1
Memory usage grows unexpectedly

Expected Behavior

Should be able to handle batch sizes > 1 on 80GB VRAM cards

Potential Investigation Areas

Memory profiling during batch processing
Gradient accumulation implementation
Cache clearing between batches
Model state handling in CustomDataset.__getitem__

Notes

The issue might be related to how tensors are handled in the custom_collate function or how the model states are managed during forward passes. Memory profiling tools like torch.cuda.memory_summary() could help identify the leak point.

DataCTE / SDXL-Training-Improvements

Training fails with OOM on batch_size > 1 despite 80GB VRAM cards #1