How to use a custom dataset?

Justin-Tan / high-fidelity-generative-compression

Pytorch implementation of High-Fidelity Generative Image Compression + Routines for neural image compression

Apache License 2.0

411 stars 77 forks source link

How to use a custom dataset? #13

Open QLaHPD opened 3 years ago

QLaHPD commented 3 years ago

I've changed the default_config.py to a custom folder with images: folder/path |----/image001.jpg |----/image002.jpg ...

But it returned me ValueError: num_samples should be a positive integer value, but got num_samples=0

Justin-Tan commented 3 years ago

Posting the full stacktrace would help. If you rename the dataset in default_config.py under DatasetPaths you must also create a new dataset with corresponding name which inherits from the BaseDataset class in src/helpers/datasets.py. There are a few examples in that file.

QLaHPD commented 3 years ago

Actually I think the problem is that the module torch.utils.data is not finding the images in the folder, so it is returning num_samples=0. What is the directory structure of OpenImages dataset?

Justin-Tan commented 3 years ago

If you post the stacktrace it would be easier to diagnose. If you look at the parent BaseDataset class you'll notice the dataset directory should contain train/ and test/ subfolders.

QLaHPD commented 3 years ago

In default_config.py:

class DatasetPaths(object):
    OPENIMAGES = '/mnt/ramdisk/root_folder'
    CITYSCAPES = ''
    JETS = ''

class args(object):
    dataset = Datasets.OPENIMAGES
    dataset_path = DatasetPaths.OPENIMAGES

The structure is:

/mnt/ramdisk/root_folder
|----/train
|--------/image001.png
|----/test
|--------/image001.png
|----/val
|--------/image001.png

Traceback (most recent call last):
  File "train.py", line 322, in <module>
    normalize=args.normalize_input_image)
  File "/home/user/anaconda3/envs/HIFIC/high-fidelity-generative-compression-master/src/helpers/datasets.py", line 75, in get_dataloaders
    pin_memory=pin_memory)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 224, in __init__
    sampler = RandomSampler(dataset, generator=generator)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 96, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

Justin-Tan commented 3 years ago

I think the problem was the following line:

self.imgs = glob.glob(os.path.join(data_dir, '*.jpg'))

which would only get JPGs. I pushed a fix to master to account for PNGs as well. Let me know if you still have issues.

QLaHPD commented 3 years ago

Unfortunately that didn't work, same error. What is the absolute path expected? I'm not using the original openimages dataset, its a custom dataset using a custom path, but I did not created a new class in datasets.py

I'm using OPENIMAGES = '/mnt/ramdisk/openimages' but the files are custom, all inside subfolders [train, test, val], all files are PNG.

The code files are in another path.

june1819 commented 3 years ago

I get this error. But I try to make "val" folder for "validation" folder. The error disappears. I get new error : "out of memory" although I try to make "batch_size = 2" and "crop_size = 64". Could you post your default_config.py if you can run train.py.

QingLicsaggie commented 3 years ago

@QLaHPD The code does not find the dataset. You can print datapath in BaseDataSet, and derived dataset to make sure.

ahmedfgad commented 3 years ago

I encountered this error and solved it.

Note that this error may exist even if the model is able to find the dataset. Most of the people say there is a problem locating the dataset but this is not always the case.

Like me, I think you are using a small dataset where there is no enough samples for each iteration. Here are more details.

The default patch size is 8. Assume you set the --n_steps parameter to 1e6. This means there are 1 million (1,000,000) iteration where each iteration requires 8 samples. Thus, you should have 8 million samples (8 * 1,000,000). If you have lower number of samples than 8 million, then the following error occurs:

ValueError: num_samples should be a positive integer value, but got num_samples=0

To solve it, you can set a smaller value to the --n_steps parameter. Try 1 for example: --n_steps 1:

python train.py --model_type compression --regime low --n_steps 1

I hope this helps.

yifeipet commented 2 years ago

I suggest you write your own dataloader and prepare a cropped image dataset so you don't need to crop images everytime.

yifeipet commented 2 years ago

Yes, writing own dataloader solve this issue.