Closed luzhaoyan closed 2 months ago
I wonder if you run the git-lfs before download the whole repo of the ultraEdit?
git lfs install
After download the whole dataset, you can just load the dataset with the load_dataset from the datasets.
from datasets import load_dataset
dataset = load_dataset( "dataset_name" )
Also, due to the network issue, we have split the 4 million of the ultraEdit data into several chunks(2000 samples for each) to push to the hugging face hub and it may require you to merge all the splits of the dataset.
I wonder if you run the git-lfs before download the whole repo of the ultraEdit?
git lfs install
After download the whole dataset, you can just load the dataset with the load_dataset from the datasets.
from datasets import load_dataset dataset = load_dataset( "dataset_name" )
Also, due to the network issue, we have split the 4 million of the ultraEdit data into several chunks(2000 samples for each) to push to the hugging face hub and it may require you to merge all the splits of the dataset.
Thanks for your reply, it's very helpful, but now i have another question, as you mentioned, how to "merge all the splits of the dataset"?
I wonder if you run the git-lfs before download the whole repo of the ultraEdit?
git lfs install
After download the whole dataset, you can just load the dataset with the load_dataset from the datasets.from datasets import load_dataset dataset = load_dataset( "dataset_name" )
Also, due to the network issue, we have split the 4 million of the ultraEdit data into several chunks(2000 samples for each) to push to the hugging face hub and it may require you to merge all the splits of the dataset.
Thanks for your reply, it's very helpful, but now i have another question, as you mentioned, how to "merge all the splits of the dataset"?
Simply using the split="all"
will be fine. We split all of the Free-form image editing data of UltraEdit into 2003 splits(each contains 2000 samples). The name of each split of the dataset object named in the form of "FreeForm_xxx", where the "xxx" is the split name ranging from 0 to 2002.
To load the 4M of free-from image editing data of UltraEdit:
from datasets import load_dataset
dataset = load_dataset( "BleachNick/UltraEdit", split="all" )
Also, if you just want to load a parquet file such as : FreeForm_1005-00000-of-00002.parquet:
from datasets import load_dataset
file_path = "FreeForm_1005-00000-of-00002.parquet"
dataset = load_dataset('parquet', data_files=file_path)
I wonder if you run the git-lfs before download the whole repo of the ultraEdit?
git lfs install
After download the whole dataset, you can just load the dataset with the load_dataset from the datasets.from datasets import load_dataset dataset = load_dataset( "dataset_name" )
Also, due to the network issue, we have split the 4 million of the ultraEdit data into several chunks(2000 samples for each) to push to the hugging face hub and it may require you to merge all the splits of the dataset.
Thanks for your reply, it's very helpful, but now i have another question, as you mentioned, how to "merge all the splits of the dataset"?
Simply using the
split="all"
will be fine. We split all of the Free-form image editing data of UltraEdit into 2003 splits(each contains 2000 samples). The name of each split of the dataset object named in the form of "FreeForm_xxx", where the "xxx" is the split name ranging from 0 to 2002. To load the 4M of free-from image editing data of UltraEdit:from datasets import load_dataset dataset = load_dataset( "BleachNick/UltraEdit", split="all" )
Also, if you just want to load a parquet file such as : FreeForm_1005-00000-of-00002.parquet:
from datasets import load_dataset file_path = "FreeForm_1005-00000-of-00002.parquet" dataset = load_dataset('parquet', data_files=file_path)
Okay, I got it. Thank you for your patient answer!
I downloaded the dataset using this method: git clone https://huggingface.co/datasets/BleachNick/UltraEdit but it seems not work,because the .parquet file content is like: And i try another method,by using "snapshot_download",and then use pandas read the file,i get this: But how can i use the image and text and other contents? It seems that i can not use the .parquet files directly,How to parse and use this dataset? https://huggingface.co/datasets/BleachNick/UltraEdit