HaozheZhao / UltraEdit

150 stars 8 forks source link

How to use the .parquet file in UltraEdit dataset? #2

Closed luzhaoyan closed 2 months ago

luzhaoyan commented 2 months ago

I downloaded the dataset using this method: git clone https://huggingface.co/datasets/BleachNick/UltraEdit 捕获 but it seems not work,because the .parquet file content is like: 捕获2 And i try another method,by using "snapshot_download",and then use pandas read the file,i get this: 捕获3 But how can i use the image and text and other contents? It seems that i can not use the .parquet files directly,How to parse and use this dataset? https://huggingface.co/datasets/BleachNick/UltraEdit

HaozheZhao commented 2 months ago

I wonder if you run the git-lfs before download the whole repo of the ultraEdit?

git lfs install

After download the whole dataset, you can just load the dataset with the load_dataset from the datasets.

from datasets import load_dataset
dataset = load_dataset( "dataset_name" )

Also, due to the network issue, we have split the 4 million of the ultraEdit data into several chunks(2000 samples for each) to push to the hugging face hub and it may require you to merge all the splits of the dataset.

luzhaoyan commented 2 months ago

I wonder if you run the git-lfs before download the whole repo of the ultraEdit?

git lfs install

After download the whole dataset, you can just load the dataset with the load_dataset from the datasets.

from datasets import load_dataset
dataset = load_dataset( "dataset_name" )

Also, due to the network issue, we have split the 4 million of the ultraEdit data into several chunks(2000 samples for each) to push to the hugging face hub and it may require you to merge all the splits of the dataset.

Thanks for your reply, it's very helpful, but now i have another question, as you mentioned, how to "merge all the splits of the dataset"?

HaozheZhao commented 2 months ago

I wonder if you run the git-lfs before download the whole repo of the ultraEdit? git lfs install After download the whole dataset, you can just load the dataset with the load_dataset from the datasets.

from datasets import load_dataset
dataset = load_dataset( "dataset_name" )

Also, due to the network issue, we have split the 4 million of the ultraEdit data into several chunks(2000 samples for each) to push to the hugging face hub and it may require you to merge all the splits of the dataset.

Thanks for your reply, it's very helpful, but now i have another question, as you mentioned, how to "merge all the splits of the dataset"?

Simply using the split="all" will be fine. We split all of the Free-form image editing data of UltraEdit into 2003 splits(each contains 2000 samples). The name of each split of the dataset object named in the form of "FreeForm_xxx", where the "xxx" is the split name ranging from 0 to 2002. To load the 4M of free-from image editing data of UltraEdit:

from datasets import load_dataset
dataset = load_dataset( "BleachNick/UltraEdit", split="all" )

Also, if you just want to load a parquet file such as : FreeForm_1005-00000-of-00002.parquet:

from datasets import load_dataset
file_path = "FreeForm_1005-00000-of-00002.parquet"
dataset = load_dataset('parquet', data_files=file_path)
luzhaoyan commented 1 month ago

I wonder if you run the git-lfs before download the whole repo of the ultraEdit? git lfs install After download the whole dataset, you can just load the dataset with the load_dataset from the datasets.

from datasets import load_dataset
dataset = load_dataset( "dataset_name" )

Also, due to the network issue, we have split the 4 million of the ultraEdit data into several chunks(2000 samples for each) to push to the hugging face hub and it may require you to merge all the splits of the dataset.

Thanks for your reply, it's very helpful, but now i have another question, as you mentioned, how to "merge all the splits of the dataset"?

Simply using the split="all" will be fine. We split all of the Free-form image editing data of UltraEdit into 2003 splits(each contains 2000 samples). The name of each split of the dataset object named in the form of "FreeForm_xxx", where the "xxx" is the split name ranging from 0 to 2002. To load the 4M of free-from image editing data of UltraEdit:

from datasets import load_dataset
dataset = load_dataset( "BleachNick/UltraEdit", split="all" )

Also, if you just want to load a parquet file such as : FreeForm_1005-00000-of-00002.parquet:

from datasets import load_dataset
file_path = "FreeForm_1005-00000-of-00002.parquet"
dataset = load_dataset('parquet', data_files=file_path)

Okay, I got it. Thank you for your patient answer!