Open melisande-c opened 4 weeks ago
very naive question (cause I'm new here :wave:), i'm curious why you would ever get data shuffling (whether or not you manually assign .dataloader_params
) without explicitly passing shuffle=True
. The default for torch.utils.data.DataLoader
is shuffle=None
(i.e. False
), and between assigning dataloader_params
here:
and creating the Dataloader here:
i don't see any internal careamics logic to set shuffle to True
?
Hi @tlambert03, after some investigation it seems that indeed by default train dataloader is not shuffled. I originally came to the conclusion that it was shuffled because not passing any dataloader parameters has good results, passing {"num_workers": 4}
has bad results but {"num_workers": 4, "shuffle": True}
has again good results. However, I have now saved the input batches during training and it seems the data is not shuffled unless "shuffle"=True
, so something strange is going on.
In the lightning Trainer
docs it says use_distributed_sampler
is by default True
. It mentions: "By default, it will add shuffle=True for the train sampler and shuffle=False for validation/test/predict samplers." This may be having some effect but it is difficult to work out what is going on in the lightning source code.
For each of the following experiments this is how I initialise the different lightning components. The only difference being changing the dataloader_params
. I also turned off augmentations by passing the empty list to ensure that wasn't causing any differences.
num_epochs = 10
config = create_n2v_configuration(
experiment_name="<RUN NAME>",
data_type="tiff",
axes="SYX",
patch_size=(64, 64),
batch_size=64,
num_epochs=num_epochs,
augmentations=[],
dataloader_params={"num_workers": 4, "shuffle": True}
)
print(config.data_config.transforms)
print(config.data_config.dataloader_params)
config.algorithm_config.optimizer.parameters["lr"] = 1e-4
lightning_module = create_careamics_module(
algorithm=config.algorithm_config.algorithm,
loss=config.algorithm_config.loss,
architecture=config.algorithm_config.model.architecture,
optimizer_parameters=config.algorithm_config.optimizer.parameters,
)
train_data_module = create_train_datamodule(
train_data=train_path,
val_data=val_path,
data_type=config.data_config.data_type,
patch_size=config.data_config.patch_size,
transforms=config.data_config.transforms,
axes=config.data_config.axes,
batch_size=config.data_config.batch_size,
dataloader_params=config.data_config.dataloader_params
)
checkpoint_callback = ModelCheckpoint(
dirpath=Path(__file__).parent / "checkpoints",
filename=config.experiment_name,
**config.training_config.checkpoint_callback.model_dump(),
)
n_batches = 5
save_dloader_callback = SaveDataloaderOutputs(n_batches=n_batches)
trainer = Trainer(
max_epochs=config.training_config.num_epochs,
precision=config.training_config.precision,
max_steps=config.training_config.max_steps,
check_val_every_n_epoch=config.training_config.check_val_every_n_epoch,
enable_progress_bar=config.training_config.enable_progress_bar,
accumulate_grad_batches=config.training_config.accumulate_grad_batches,
gradient_clip_val=config.training_config.gradient_clip_val,
gradient_clip_algorithm=config.training_config.gradient_clip_algorithm,
callbacks=[
checkpoint_callback,
HyperParametersCallback(config),
save_dloader_callback,
],
default_root_dir=Path(__file__).parent,
logger=WandbLogger(
name=config.experiment_name,
save_dir=Path(__file__).parent / Path(config.experiment_name) / "logs",
),
)
trainer.fit(model=lightning_module, datamodule=train_data_module)
Seems to work fine, below is the prediction output
{"num_workers": 4}
Very bad results
{"num_workers": 4, "shuffle": True}
Results look good again
Looking at the loss and validation curves, it seems when dataloader_params={"num_workers": 4}
the model is overfitting somehow since the training loss gets much lower than the other runs.
However, when I save the batches during training for the experiment with no dataloader params they are in the same order for each epoch 🤷♀️. Only in the shuffle=True
case are the batches shuffled.
The safest thing to do of course is enforce shuffling to be True, but it would be good to get to the bottom of this.
When passing parameters to the dataloader in the
TrainDataModule
it may prevent the dataloader from shuffling the data. A fix is to explicitly passshuffle=True
. After some further investigation an issue should likely be raised on Pytorch Lightning.Examples initialising the
TrainDataModule
that will give bad results because the shuffling is prevented somehow.or
To specify the number of workers and have shuffling use
dataloader_params={"num_workers": 4, "shuffle"=True}