boheumd / D-NeRV

The official implementation of 'Towards Scalable Neural Representation for Diverse Videos' (CVPR 2023)
https://boheumd.github.io/D-NeRV
MIT License
46 stars 1 forks source link
implicit-neural-representation video-compression video-reconstruction

Towards Scalable Neural Representation for Diverse Videos (CVPR2023)

Project Page | Paper

The official repository of our paper "Towards Scalable Neural Representation for Diverse Videos".

teaser

Model Overview

model

Requirements

You can install the conda environment by running:

conda create -n dnerv python=3.9.7
conda activate dnerv
conda install pytorch torchvision pytorch-cuda=11.7 -c pytorch -c nvidia
pip install tensorboard
pip install tqdm dahuffman pytorch_msssim

Video Compression

We adopt the existing deep image compression models provided by CompressAI. We provide the pre-extracted ground-truth video frames and pre-compressed keyframes for UVG and UCF101 datasets in this google drive link.

Unzip it under the data/ folder and make sure the data structure is as below.

    ├── data
        └── UVG
            ├── gt
            ├── keyframe
            ├── annotation
        └── UCF101
            ├── gt
            ├── keyframe
            ├── annotation

Please note that, we split the 1024x1920 UVG videos into non-overlap 256x320 frame patches during training due to the GPU memory limitation.

Running

Training

We train our model on 4 RTX-A6000 GPUs. To compare with other state-of-the-art video compression methods, we run for 1600 epochs on UVG dataset and 800 epochs on UCF101 dataset. You can change to a smaller number of epochs to reduce the training time.

# UVG datset
python train.py --dataset UVG --model_type ${model_type} --model_size ${model_size} \
    -e 1600 -b 32 --lr 5e-4 --loss_type Fusion6 -d

# UCF101 datset
python train.py --dataset UCF101 --model_type ${model_type} --model_size ${model_size} \
    -e 800  -b 32 --lr 5e-4 --loss_type Fusion19 -d

Testing

# Evaluate model without model quantization
python train.py --dataset UVG --model_type D-NeRV --model_size M \
        --eval_only --model saved_model/UVG/D-NeRV_M.pth

# Evaluate model with model quantization
python train.py --dataset UVG --model_type D-NeRV --model_size M \
        --eval_only --model saved_model/UVG/D-NeRV_M.pth --quant_model

Dump Predicted Frames

python train.py --dataset UVG --model_type D-NeRV --model_size M \
        --eval_only --model saved_model/UVG/D-NeRV_M.pth --quant_model \
        --dump_images

Please note that, for the UVG dataset, after we splitting 1024x1920 videos into 256x320 frame patches, the PSNR/MS-SSIM results will be different from the actual PSNR/MS-SSIM of 1024x1920. Therefore, we need to dump the predicted frame patches first, and then re-evaluate the PSNR/MS-SSIM with the ground-truth 1024x1980 video frames.

PSNR/MS-SSIM vs. BPP Ratio Calculation

UVG Dataset

Results for different model configs are shown in the following table. The PSNR/MS-SSIM results are reported from the model with quantization. Model Arch Model Param(M) Entropy Encoding Keyframe Size(Mb) Total(Mb) BPP PNSR MS-SSIM Link
D-NeRV XS 8.02 0.883 88.39 145.0 0.0189 34.11 0.9479 link
D-NeRV S 15.96 0.881 88.39 200.9 0.0262 34.76 0.9540 link
D-NeRV M 24.20 0.880 123.2 293.6 0.0383 35.74 0.9604 link
D-NeRV L 41.66 0.877 175.1 467.3 0.0609 36.78 0.9668 link
D-NeRV XL 69.75 0.875 254.7 730.3 0.0952 37.43 0.9719 link

UCF101 Dataset (training split)

Model Arch Model Param(M) Entropy Encoding Keyframe Size(Mb) Total(Mb) BPP PNSR MS-SSIM Link
D-NeRV S 21.40 0.882 481.6 632.7 0.0559 28.11 0.9153 link
D-NeRV M 38.90 0.891 481.6 758.7 0.0671 29.15 0.9364 link
D-NeRV L 61.30 0.891 481.6 918.3 0.0812 29.97 0.9501 link
NeRV S 88.00 0.903 635.9 0.0562 26.78 0.9094 link
NeRV M 105.3 0.900 758.4 0.0671 27.06 0.9177 link
NeRV L 127.2 0.903 919.1 0.0813 27.61 0.9284 link

BPP Calculation

$BPP=\dfrac{\overbrace{\text{Model Param} 8}^{\text{int8 quantization}} \text{Entropy Encoding} + \text{Keyframe Size}}{\text{H} \text{W} \text{Num Frames}}$

For UVG dataset, H = 1024, W = 1920, Num Frames = 3900.

For UCF101 dataset, training split, H = 256, W = 320, Num Frames = 138041.

Citation

If you find our code or our paper useful for your research, please [★star] this repo and [cite] the following paper:

@inproceedings{he2023dnerv,
  title = {Towards Scalable Neural Representation for Diverse Videos},
  author = {He, Bo and Yang, Xitong and Wang, Hanyu and Wu, Zuxuan and Chen, Hao and Huang, Shuaiyi and Ren, Yixuan and Lim, Ser-Nam and Shrivastava, Abhinav},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2023},
}