finetuning could include more info on making datasets

skeddles commented 1 year ago

I want to try making a better version of the pokemon dataset, but I'm not clear on how to combine the spreadsheet / images into a dataset. perhaps you could provide some steps on how exactly the pokemon one was created. or at least link to a related tutorial or piece of software.

offchan42 commented 1 year ago

Here's how I did it:

put all the jpeg images I need in data/train/ folder
create data/train/metadata.jsonl which is basically a file containing image paths and caption on each line. Here is an example of a line: {"file_name":"abc.jpg", "text": "an image of a cat"}. Read more here: https://huggingface.co/docs/datasets/image_dataset#imagefolder
use datasets package from hugging face to read that data folder and then upload it to your hugging face account.
```
from datasets import load_dataset
```

dataset = load_dataset("data/") # this command needs file_name key to exist in your metadata file and it also look for a folder named "train" automatically dataset.push_to_hub("your_username/your_dataset_repo_name", private=True)


Note that this `push_to_hub()` command will convert your image data into a `parquet` file which is an efficient data-storage format (I guess?).

That's it. Then you can download the dataset to any place using this command: `ds = load_dataset("username/dataset_repo", split="train", use_auth_token=True)`

Apply `use_auth_token=True` only if your dataset is private.

justinpinkney commented 1 year ago

There are also some more details in this issue: https://github.com/LambdaLabsML/examples/issues/16

this has confused many people, I'll try and improve the documentation in the near future!

LambdaLabsML / examples

finetuning could include more info on making datasets #23