About the way of preprocessing the Vimeo90K dataset

Zhazhan commented 2 years ago

Thank you for your outstanding and constructive contributions!

Recently I have been trying to use CompressAI to implement a custom network and compare its performance with other methods that have been implemented in CompressAI. However, I found that the Vimeo90K dataset had three different subsets and I was confused about which one to download. Also, I did not find further description about how to preprocess the Vimeo90K dataset after reading both the documentation and paper. Could you please tell me which subset to download, and how to preprocess it? Or, instead, does the dataset used for training have little effect on the final performance?

Thanks again for your excellent work. Looking forward to your reply.

YodaEmbedding commented 2 years ago

2022-12-21 EDIT:

You can now use Vimeo90K for image compression training without any further processing. Modify the training script as follows:

    from compressai.datasets import Vimeo90kDataset

    train_dataset = Vimeo90kDataset(
        args.dataset, split="train", transform=train_transforms
    )
    test_dataset = Vimeo90kDataset(
        args.dataset, split="valid", transform=test_transforms
    )

Previous preprocessing technique:

Not sure how the maintainers preprocessed their dataset, but here's my methodology:

wget http://data.csail.mit.edu/tofu/dataset/vimeo_triplet.zip
unzip vimeo_triplet.zip
python process_vimeo_triplet.py

where:

# process_vimeo_triplet.py

import os
import shutil
from pathlib import Path

def extract_dataset_split(in_dir: str, out_dir: str, list_filename: str):
    with open(list_filename) as f:
        lines = f.read().splitlines()

    os.makedirs(out_dir, exist_ok=True)

    in_dir_path = Path(in_dir)
    out_dir_path = Path(out_dir)

    for subdir in lines:
        if subdir == "":
            continue

        subdir_path = in_dir_path / subdir
        out_prefix = str(out_dir_path / (subdir.replace("/", "_") + "_"))
        in_images = os.listdir(subdir_path)

        for image in in_images:
            src = subdir_path / image
            dst = out_prefix + image
            print(f"{src} -> {dst}", subdir)
            shutil.copy2(src, dst)

extract_dataset_split(
    "vimeo_triplet/sequences",
    "vimeo90k_compressai/train",
    "vimeo_triplet/tri_trainlist.txt",
)

extract_dataset_split(
    "vimeo_triplet/sequences",
    "vimeo90k_compressai/test",
    "vimeo_triplet/tri_testlist.txt",
)

which copies files as follows:

vimeo_triplet/sequences/00001/0001/im1.png -> vimeo90k_compressai/train/00001_0001_im1.png
vimeo_triplet/sequences/00001/0001/im2.png -> vimeo90k_compressai/train/00001_0001_im2.png
vimeo_triplet/sequences/00001/0001/im3.png -> vimeo90k_compressai/train/00001_0001_im3.png
vimeo_triplet/sequences/00001/0002/im1.png -> vimeo90k_compressai/train/00001_0002_im1.png
...

with the resulting directory tree:

vimeo90k_compressai/
    train/
        00001_0001_im1.png
        00001_0001_im2.png
        00001_0001_im3.png
        00001_0002_im1.png
        ...
    test/
        ...

Note that the "test" directory actually contains the validation set since testing is done via the RD curves on the Kodak test set (768x512 sized images).

Zhazhan commented 2 years ago

Thank you for kindly sharing! I'll try it out.

fracape commented 2 years ago

Hi sorry for the late reply. The above solution works perfectly. Another way that can speed up training is to generate npy files containing the train/validation sets. Please note that train.py is a "simple" exemplary training loop, this would require to tweak the exemplary dataloader.

InterDigitalInc / CompressAI

About the way of preprocessing the Vimeo90K dataset #105