Open MarcoForte opened 1 week ago
Hi! thanks for your contribution!, great first issue!
Hey @MarcoForte. Fascinating, I have never seen this ;) Can you share a reproducible script with fake data ? Does this issue still happen if you use a single StreamingDataset ?
Cheers @tchaton, yeah it was a bit surprising 👀. Only noticed it since I was in torch.compile mode, and the recompilation was being triggered causing a big slowdown. Otherwise it is possible it could go unnoticed...
It did happen with the single StreamingDataset
also, bypassing the CombinedStreamingDataset
.
If I find a moment I'll try for a reproducible script, thanks
Thanks a lot @MarcoForte. Looking forward for the code to debug it
Hey @MarcoForte Any chance to provide a reproducible script ?
Hey @MarcoForte
Unfortunately, I can't reproduce this issue on my end.
import os
from lightning_cloud.utils.data_connection import add_s3_connection
from lightning.data import StreamingDataset, StreamingDataLoader
from lightning.data.streaming.serializers import JPEGSerializer
import torchvision.transforms.v2 as T
import open_clip
from tqdm import tqdm
# 1. Add the prepared dataset to your teamspace
add_s3_connection("laoin-400m")
# 2. Create the streaming dataset
class LAIONStreamingDataset(StreamingDataset):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.tokenizer = open_clip.get_tokenizer('ViT-B-32', context_length=512) # You can use any tokenizer
self.serializer = JPEGSerializer()
self.preprocess = T.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
def __getitem__(self, index):
_, image, text, _, _, _ = super().__getitem__(index)
image = self.serializer.deserialize(image).float()
return self.preprocess(image)
dataset = LAIONStreamingDataset(input_dir="/teamspace/s3_connections/laoin-400m")
dataloader = StreamingDataLoader(dataset, batch_size=64, num_workers=os.cpu_count())
batch_size = 64
for batch in tqdm(dataloader):
assert batch.shape[0] == batch_size
🐛 Bug
Hello, I'm running into an issue where my batch size begins to vary half way through an epoch.
To Reproduce
I logged when it deviated from 64. It happens in all epochs, and when training single gpu also.
Code sample
Unfortunately I can't share the code, but I will share as much as I can, and I can run many experiments. I'm launching the training with
torchrun --standalone --nnodes=1 --nproc-per-node=8 main.py
I usesets = [StreamingDataset(a),StreamingDataset(b))]
andDataloader(CombinedStreamingDataset(datasets=sets))
I launch the training throughtrainer.fit
.drop_last=True
Expected behavior
Fixed batch size throughout epoch.
Environment
Using the ngc 23.05
Ubuntu 22.04 including Python 3.10 NVIDIA CUDA 12.4.1 NVIDIA cuBLAS 12.4.5.8 NVIDIA cuDNN 9.1.0.70 NVIDIA NCCL 2.21.5 lightning==2.3.0 litdata==0.2.12 8 x H100