Closed gudwns1215 closed 3 weeks ago
Hi, yes the project expects the dataset to be serialized to TFRecords. The exact schema can be seen in the data generators: https://github.com/SmilingWolf/JAX-CV/blob/8ccfd1f3ce40f5d9e7470c1f900493250e83732e/Generators/WDTaggerGen.py#L52
You're supposed to point the training script to a folder (--dataset-root
in the example cmdline below), inside which you should have two other folders:
One more thing you need is a JSON file with a few infos about the training data.
For example, this is the one I'm using right now (wd_v3.json
in the example cmdline below):
{
"num_classes": 10861,
"train_samples": 5750784,
"val_samples": 319488
}
A typical cmdline looks like this: python training_loop.py --dataset-file datasets/wd_v3.json --dataset-root gs://<bucket>/wd_v3 --checkpoints-root gs://<bucket>/checkpoints --batch-size 32 --image-size 448 --patch-size 16 --epochs 10 --warmup-epochs 0 --learning-rate 0.0005 --weight-decay 0.005 --mixup-alpha 0.6 --drop-path-rate 0.2 --rotation-ratio 0.0 --cutout-max-pct 0.0 --cutout-patches 0 --model-name vit_base
This is a great project. It would be fantastic if you could write an introduction to basic training methods. Perhaps you could script the management of these parameters, although it seems that not many people are paying attention to these parameters. I have some questions I hope the author can answer. What is the number of classes in wd_v3.json? Where should we download the checkpoints we are using? Perhaps there should be a more detailed introduction available.
The num_classes
field is the number of labels selected for training. In my specific case, it is the number of different danbooru tags the output model is supposed to support.
Checkpoints are available on HuggingFace, here: https://huggingface.co/SmilingWolf The v3 models are the ones compatible with this codebase. The msgpack checkpoints are supported only as local files (ie. downloaded on the same machine where the script is running). Once you get started, the checkpoints are saved by orbax-checkpoint which supports both local paths as well as GCS buckets.
Hi, yes the project expects the dataset to be serialized to TFRecords.
can i have a toy example of this kind of tfrecords (e.g. aibooru's json file and tfrecord files, mentioned in #11)? i can try to make a data preparation script for this project.
we have many tagged datasets from many different sites, maybe i can run this training code on different sites.
Here should be almost everything necessary to create my exact same dataset, minus the images: https://we.tl/t-qlu2gXe3AO
If you make an honest to god half decent repo or notebook illustrating the steps going from downloading the images to tfrecord creation I'll add a README.md to this repo and gladly link to it.
Here should be almost everything necessary to create my exact same dataset, minus the images: https://we.tl/t-qlu2gXe3AO
If you make an honest to god half decent repo or notebook illustrating the steps going from downloading the images to tfrecord creation I'll add a README.md to this repo and gladly link to it.
It seems that this provided URL is invalid now. Is there an alternative or developments to this?
I want to use other training sets, how should I edit json
I found TFrecord.py in Translator, thanks for kind support to the request! I will close this issue since there is a translator.
Hi! Thanks for sharing awesome project!
Is the danbooru dataset used in this project is formatted in tfrecord? or just mixture of png and metadata json?