facebookresearch / sscd-copy-detection

Open source implementation of "A Self-Supervised Descriptor for Image Copy Detection" (SSCD).
MIT License
263 stars 19 forks source link

ValueError: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch #9

Closed ghost closed 1 year ago

ghost commented 1 year ago

I encountered a pytorch_lightning error while using SSCD to train my own data:

Training command:

python ./sscd/train.py  --gpus=4 --nodes=1 \
  --train_dataset_path=/workspace/sscd-copy-detection/mouse_dst/imgs \
  --entropy_weight=30 --augmentations=ADVANCED --mixup=true \
  --output_path=/workspace/sscd-copy-detection/output

The error message is as follows:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 208, in _wrapped_function
    result = function(*args, **kwargs)
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 236, in new_process
    results = trainer.run_stage()
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/loops/base.py", line 140, in run
    self.on_run_start(*args, **kwargs)
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 197, in on_run_start
    self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module)
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/data_loading.py", line 594, in reset_train_val_dataloaders
    self.reset_train_dataloader(model=model)
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/data_loading.py", line 385, in reset_train_dataloader
    self.train_dataloader = CombinedLoader(self.train_dataloader, self._data_connector.multiple_trainloader_mode)
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/supporters.py", line 351, in __init__
    self._wrap_loaders_max_size_cycle()
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/supporters.py", line 447, in _wrap_loaders_max_size_cycle
    all_lengths = apply_to_collection(self.loaders, Iterable, get_len, wrong_dtype=(Sequence, Mapping))
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/utilities/apply_func.py", line 96, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py", line 156, in get_len
    if has_len(dataloader):
  File "/workspace/sscd-copy-detection/venv/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py", line 84, in has_len
    raise ValueError("`Dataloader` returned 0 length. Please make sure that it returns at least 1 batch")
ValueError: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch

Python 3.10 environment packages:

Package                 Version
----------------------- ------------------
absl-py                 1.4.0
aiohttp                 3.8.4
aiosignal               1.3.1
async-timeout           4.0.2
attrs                   23.1.0
augly                   1.0.0
cachetools              5.3.0
certifi                 2023.5.7
charset-normalizer      3.1.0
classy-vision           0.7.0
cmake                   3.26.3
faiss-gpu               1.7.2
filelock                3.12.0
frozenlist              1.3.3
fsspec                  2023.5.0
future                  0.18.3
fvcore                  0.1.5.post20221221
google-auth             2.18.0
google-auth-oauthlib    1.0.0
grpcio                  1.54.2
idna                    3.4
iopath                  0.1.10
Jinja2                  3.1.2
lightning-bolts         0.5.0
lit                     16.0.3
Markdown                3.4.3
MarkupSafe              2.1.2
mpmath                  1.3.0
multidict               6.0.4
networkx                3.1
numpy                   1.24.3
oauthlib                3.2.2
packaging               23.1
pandas                  2.0.1
Pillow                  9.5.0
pip                     23.1.2
portalocker             2.7.0
protobuf                4.23.0
pyasn1                  0.5.0
pyasn1-modules          0.3.0
pyDeprecate             0.3.1
python-dateutil         2.8.2
python-magic            0.4.27
pytorch-lightning       1.5.10
pytz                    2023.3
PyYAML                  6.0
regex                   2023.5.5
requests                2.30.0
requests-oauthlib       1.3.1
rsa                     4.9
setuptools              59.5.0
six                     1.16.0
sympy                   1.12
tabulate                0.9.0
tensorboard             2.13.0
tensorboard-data-server 0.7.0
termcolor               2.3.0
torch                   1.13.1+cu117
torchaudio              0.13.1+cu117
torchmetrics            0.11.4
torchvision             0.14.1+cu117
tqdm                    4.65.0
triton                  2.0.0
typing_extensions       4.5.0
tzdata                  2023.3
urllib3                 1.26.15
Werkzeug                2.3.4
wheel                   0.40.0
yacs                    0.1.8
yarl                    1.9.2

After debugging, it was found that the length of self.train_dataloader obtained at line 365 in pytorch_lightning/trainer/data_loading.py was 0, which caused this error. What is the reason for this error?

ghost commented 1 year ago

I found that my train_loader length is 0, but train_dataset length is 966.Debugging in progress.

ghost commented 1 year ago

I finally have solved this problem:

In file 'pytorch_lightning/trainer/data_loading.py' line 341 function: TrainerDataLoadingMixin._update_dataloader:

change:

    @staticmethod
    def _update_dataloader(dataloader: DataLoader, sampler: Sampler, mode: Optional[RunningStage] = None) -> DataLoader:
        dl_kwargs = TrainerDataLoadingMixin._get_dataloader_init_kwargs(dataloader, sampler, mode=mode)
        dl_cls = type(dataloader)
        dataloader = dl_cls(**dl_kwargs)
        return dataloader

to

    @staticmethod
    def _update_dataloader(dataloader: DataLoader, sampler: Sampler, mode: Optional[RunningStage] = None) -> DataLoader:
        dl_kwargs = TrainerDataLoadingMixin._get_dataloader_init_kwargs(dataloader, sampler, mode=mode)
        dl_kwargs["drop_last"] = False
        dl_cls = type(dataloader)
        dataloader = dl_cls(**dl_kwargs)
        return dataloader
edpizzi commented 1 year ago

Sorry for the delay, but I'm glad you found a solution.

That said, I think 900 is likely too few images to train on. drop_last = False will mean that one training iteration completes per epoch, but such a small training set may not result in a useful representation. It would likely be better to use one of the provided models, rather than training on a small dataset.

Do you mind sharing what you're trying to do by training in this way?

ghost commented 1 year ago

Thank you for your feedback, @edpizzi . We are exploring robust models with the aim of achieving Open World Object Detection (OWOD) using a minimal number of training samples. While we understand the limitations of training on a small dataset, our goal is to see how well these models can generalize with limited data. We appreciate any insights or recommendations you might have on this matter.