Closed eisneim closed 2 weeks ago
Hi @eisneim ,
Thank you for the suggestion. Yes, as you said, while using more tokens can lead to better reconstruction, it also introduces additional challenges for Transformers to learn. There should be a balance between the number of tokens and the final results.
We will consider expanding the Open-MAGVIT2 tokenizer family with scaled-up training data, larger backbones, higher compression ratios, and so on. However, due to numerous tasks in progress (e.g., finalizing autoregressive training), training a tokenizer with higher compression ratios is not currently a high priority. In fact, the training can be initiated by simply modifying the config files. You are also welcome to contribute to our repository if you decide to try it in the future.
thank you @yxgeee i'll try to train higher compression ratio on imagenet but with just 1 RTX 4090 it might take a long time
@yxgeee here is what i changed:
ch_mult: [1,1,1,2,2,4]
this will change latent from 16x16 -> 8x8 using 4090 GPU batch size 10, memory usage 22GB, 2.92it/s
from typing import Optional, Union, Callable, Tuple, Any
import os
from pathlib import Path
from omegaconf import OmegaConf
import torch
from torchvision import transforms as T
from torch.utils.data import Dataset
from taming.data.base import ImagePaths
def parse_dir(root_dir):
items = []
for root, dirs, files in os.walk(root_dir):
for file in files:
if (file.lower().endswith('.jpg') or file.lower().endswith('.jpeg') or file.lower().endswith('.png')) and file[0] != ".":
items.append(os.path.join(root, file))
return items
class CustomdirBase(Dataset):
def __init__(self):
self.data = []
def __len__(self):
return len(self.data)
def __getitem__(self, i):
return self.data[i]
class CustomdirTrain(CustomdirBase):
def __init__(self, root: str,
size: Union[Tuple[int, int], int] = 256) -> None:
abspaths = parse_dir(root)
print("----> train images", root, len(abspaths))
self.data = ImagePaths(abspaths,
labels=None,
size=size,
random_crop=True)
class CustomdirValidation(CustomdirBase):
def __init__(self, root: str,
size: Union[Tuple[int, int], int] = 256) -> None:
abspaths = parse_dir(root)
print("----> validate images", root, len(abspaths))
self.data = ImagePaths(abspaths,
labels=None,
size=size,
random_crop=True)
class CustomdirTest(CustomdirBase):
def __init__(self, root: str,
size: Union[Tuple[int, int], int] = 256) -> None:
abspaths = parse_dir(root)
print("----> test images", root, len(abspaths))
self.data = ImagePaths(abspaths,
labels=None,
size=size,
random_crop=True)
Great thanks to the authors of this project!
Bytedance's TiTok use 1d codebook achieves impressive 256x256 to 32 token super high compression ratio, this is very useful for long video multi modal understanding task
do you have any plan to train a higher compression ratio magvit2? eg. 256x256 -> 8x8 this might cause rFID to go up and not usable for image generation, but this tokenizer would be very use for for multi-modal LLMs
thanks!