Nikolai10 / PerCo

PyTorch implementation of PerCo (Towards Image Compression with Perfect Realism at Ultra-Low Bitrates, ICLR 2024)
Apache License 2.0
34 stars 1 forks source link
computer-vison diffusion-models generative-model image-compression pytorch text-to-image

Perceptual Compression (PerCo)

This repository provides a PyTorch implementation of PerCo based on:

Different from the original work, we use Stable Diffusion v2.1 (Rombach et al., CVPR 2022) as latent diffusion model and hence refer to our work as PerCo (SD). This is to differentiate from the official work, which is based on a proprietary, not publicly available, pre-trained variant based on GLIDE (Nichol et al., ICML 2022).

Under active development.

Updates

06/16/2024

  1. Finetuned whole U-Net (not just linear layers)
  2. Slightly improved results (limited to 50k optimization steps)
  3. Released pre-trained models
  4. Ablation studies: experimented with LoRA and FSQ (no improvements achieved)

05/29/2024

  1. Switched back to official hyper-encoder design, resolved training instabilities
  2. Significantly improved results (limited to 50k optimization steps)

05/24/2024

  1. Initial release of this project

Visual Impressions

Visual Comparison on the Kodak dataset, for our lowest bit-rate (0.0019bpp). Column 1: ground truth. Columns 2-5: set of reconstructions that reflect the uncertainty about the original image source.

0.0019_kodim13_a river runs through a rocky forest with mountains in the background.png
Global conditioning: "a river runs through a rocky forest with mountains in the background".
0.0019_kodim22_a red barn with a pond in the background.png
Global conditioning: "a red barn with a pond in the background".
0.0019_kodim23_two parrots standing next to each other with leaves in the background.png
Global conditioning: "two parrots standing next to each other with leaves in the background".

More visual results can be found here.

Quantitative Performance

In this section we quantitatively compare the performance of PerCo (SD v2.1) to the officially reported numbers. All models were trained using a reduced set of optimization steps (50k). Note that the performance is bounded by the LDM auto-encoder, denoted as SD v2.1 auto-encoder.

We generally obtain highly competitive results in terms of perception (FID, KID), especially for the ultra-low bit-rates, but at the cost of lower image fidelity (MS-SSIM, LPIPS). Note that PerCo (official) was trained using 5 epochs (9M training samples / batch size 160 * 5 epochs = 281250 optimization steps) vs. 50k steps, which roughly corresponds to 18%. Also note that we have not yet considered LPIPS as an auxiliary loss, which is known to increase performance at higher bit-rates.

We will continue our experiments and hope to release more powerful variants at a later stage.

PerCo (official) vs. PerCo (SDv2.1)

Install

$ git clone https://github.com/Nikolai10/PerCo.git 

Please follow our Installation Guide with Docker.

Training/ Inference/ Evaluation

Please have a look at the example notebook for more information.

We use the OpenImagesV6 training dataset by default, similar to MS-ILLM. Please familiarize yourself with the data loading mechanisms (see _openimages_v6.py) and adjust the file paths and training settings in config.py accordingly. Corrupted images must be excluded, see _INVALID_IMAGE_NAMES for more details.

We also provide a simplified Google Colab demo that integrates any tfds dataset (e.g. CLIC 2020), with no data engineering tasks involved: open tutorial.

TODOs

Note:

Pre-trained Models

Pre-trained models corresponding to 0.1250bpp, 0.0313bpp and 0.0019bpp can be downloaded here.

All models were trained using a DGX H100 using the following command:

# note that prediction_type must equal config.py prediction_type
!accelerate launch --multi_gpu --num_processes=8 /tf/notebooks/PerCo/src/train_sd_perco.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" \
--validation_image "/tf/notebooks/PerCo/res/eval/kodim13.png" "/tf/notebooks/PerCo/res/eval/kodim23.png" \
--allow_tf32 \
--dataloader_num_workers=12 \
--resolution=512 --center_crop --random_flip \
--train_batch_size=20 \
--gradient_accumulation_steps=1 \
--num_train_epochs=5 \
--max_train_steps 50000 \
--validation_steps 500 \
--prediction_type="v_prediction" \
--checkpointing_steps 500 \
--learning_rate=1e-05 \
--adam_weight_decay=1e-2 \
--max_grad_norm=1 \
--lr_scheduler="constant" \
--lr_warmup_steps=10000 \
--checkpoints_total_limit=2 \
--output_dir="/tf/notebooks/PerCo/res/cmvl_2024"

If you find better hyper-parameters, please share them with the community.

Directions for Improvement

File Structure

 docker                                             # Docker functionality + dependecies
     ├── install.txt                                 
 notebooks                                          # jupyter-notebooks
     ├── FilterMSCOCO.ipynb                         # How to obtain MS-COCO 30k        
     ├── PerceptualCompression.ipynb                # How to train and eval PerCo                 
 res
     ├── cmvl_2024/                                 # saved model, checkpoints, log files
     ├── data/                                      # evaluation data (must be downloaded separately)
        ├── Kodak/                                  # Kodak dataset (https://r0k.us/graphics/kodak/, 24 images)
        ├── Kodak_gen/                              # Kodak reconstructions
        ├── MSCOCO30k/                              # MS-COCO 30k dataset (see ./notebooks/FilterMSCOCO.ipynb)
        ├── MSCOCO30k_gen/                          # MS-COCO 30k reconstructions
     ├── doc/                                       # addtitional resources
     ├── eval/                                      # sample images + reconstructions
 src
     ├── diffusers/                                 # local copy of https://github.com/huggingface/diffusers (v0.27.0)
     ├── compression_utils.py                       # CLI tools for PerCo compression/ decompression
     ├── config.py                                  # PerCo global configuration (training + inference)
     ├── helpers.py                                 # helper functionality
     ├── hyper_encoder_v2.py                        # hyper-encoder + quantization (based on HiFiC)
     ├── hyper_encoder.py                           # hyper-encoder + quantization (based on ELIC)
     ├── lpips_stable.py                            # stable LPIPS implementation based on MS-ILLM/ NeuralCompression
     ├── openimages_v6.py                           # minimalistic dataloader for OpenImagesV6
     ├── pipeline_sd_perco.py                       # custom HuggingFace Pipeline which bundles image generation (=decompression)
     ├── tfds_interface.py                          # simple PyTorch wrapper to make use of tfds
     ├── train_sd_perco.py                          # PerCo training functionality
     ├── unet_2d_perco.py                           # extended UNet2DConditionModel which accepts local features from the hyper-encoder

Acknowledgment

This project is based on/ takes inspiration from:

We thank the authors for providing us with the official evaluation points as well as helpful insights.

Interested in Working with Me?

Feel free to reach out: nikolai.koerber@tum.de. I am particularly interested in PhD intern positions.

License

Apache License 2.0