About dataset format - Githubissues

gudwns1215 commented 8 months ago

Hi! Thanks for sharing awesome project!

Is the danbooru dataset used in this project is formatted in tfrecord? or just mixture of png and metadata json?

SmilingWolf commented 8 months ago

Hi, yes the project expects the dataset to be serialized to TFRecords. The exact schema can be seen in the data generators: https://github.com/SmilingWolf/JAX-CV/blob/8ccfd1f3ce40f5d9e7470c1f900493250e83732e/Generators/WDTaggerGen.py#L52

You're supposed to point the training script to a folder (--dataset-root in the example cmdline below), inside which you should have two other folders:

record_shards_train/*.tfrecord
record_shards_val/*.tfrecord

One more thing you need is a JSON file with a few infos about the training data. For example, this is the one I'm using right now (wd_v3.json in the example cmdline below):

{
  "num_classes": 10861,
  "train_samples": 5750784,
  "val_samples": 319488
}

A typical cmdline looks like this: python training_loop.py --dataset-file datasets/wd_v3.json --dataset-root gs://<bucket>/wd_v3 --checkpoints-root gs://<bucket>/checkpoints --batch-size 32 --image-size 448 --patch-size 16 --epochs 10 --warmup-epochs 0 --learning-rate 0.0005 --weight-decay 0.005 --mixup-alpha 0.6 --drop-path-rate 0.2 --rotation-ratio 0.0 --cutout-max-pct 0.0 --cutout-patches 0 --model-name vit_base

youhua1 commented 7 months ago

This is a great project. It would be fantastic if you could write an introduction to basic training methods. Perhaps you could script the management of these parameters, although it seems that not many people are paying attention to these parameters. I have some questions I hope the author can answer. What is the number of classes in wd_v3.json? Where should we download the checkpoints we are using? Perhaps there should be a more detailed introduction available.

SmilingWolf commented 6 months ago

The num_classes field is the number of labels selected for training. In my specific case, it is the number of different danbooru tags the output model is supposed to support.

Checkpoints are available on HuggingFace, here: https://huggingface.co/SmilingWolf The v3 models are the ones compatible with this codebase. The msgpack checkpoints are supported only as local files (ie. downloaded on the same machine where the script is running). Once you get started, the checkpoints are saved by orbax-checkpoint which supports both local paths as well as GCS buckets.

narugo1992 commented 5 months ago

Hi, yes the project expects the dataset to be serialized to TFRecords.

can i have a toy example of this kind of tfrecords (e.g. aibooru's json file and tfrecord files, mentioned in #11)? i can try to make a data preparation script for this project.

we have many tagged datasets from many different sites, maybe i can run this training code on different sites.

SmilingWolf commented 5 months ago

Here should be almost everything necessary to create my exact same dataset, minus the images: https://we.tl/t-qlu2gXe3AO

If you make an honest to god half decent repo or notebook illustrating the steps going from downloading the images to tfrecord creation I'll add a README.md to this repo and gladly link to it.

lzardy commented 3 months ago

Here should be almost everything necessary to create my exact same dataset, minus the images: https://we.tl/t-qlu2gXe3AO

If you make an honest to god half decent repo or notebook illustrating the steps going from downloading the images to tfrecord creation I'll add a README.md to this repo and gladly link to it.

It seems that this provided URL is invalid now. Is there an alternative or developments to this?

youhua1 commented 1 month ago

I want to use other training sets, how should I edit json

gudwns1215 commented 3 weeks ago

I found TFrecord.py in Translator, thanks for kind support to the request! I will close this issue since there is a translator.

SmilingWolf / JAX-CV

About dataset format #10