Open skeddles opened 1 year ago
Here's how I did it:
data/train/
folderdata/train/metadata.jsonl
which is basically a file containing image paths and caption on each line. Here is an example of a line: {"file_name":"abc.jpg", "text": "an image of a cat"}
. Read more here: https://huggingface.co/docs/datasets/image_dataset#imagefolderdatasets
package from hugging face to read that data folder and then upload it to your hugging face account.
from datasets import load_dataset
dataset = load_dataset("data/") # this command needs file_name
key to exist in your metadata file and it also look for a folder named "train" automatically
dataset.push_to_hub("your_username/your_dataset_repo_name", private=True)
Note that this `push_to_hub()` command will convert your image data into a `parquet` file which is an efficient data-storage format (I guess?).
That's it. Then you can download the dataset to any place using this command: `ds = load_dataset("username/dataset_repo", split="train", use_auth_token=True)`
Apply `use_auth_token=True` only if your dataset is private.
There are also some more details in this issue: https://github.com/LambdaLabsML/examples/issues/16
this has confused many people, I'll try and improve the documentation in the near future!
I want to try making a better version of the pokemon dataset, but I'm not clear on how to combine the spreadsheet / images into a dataset. perhaps you could provide some steps on how exactly the pokemon one was created. or at least link to a related tutorial or piece of software.