Open liming-ai opened 1 year ago
Yes, in order to make the dataset more readable and scalable, we have tidied up the dataset so that it does not directly match the training code. We will subsequently update the training code to match the Huggingface dataset, which is organised into single images, pairs, and groups (e.g. 8k, 8k_pair, 8k_group), so you can choose the one that is most convenient for you.
Yes, in order to make the dataset more readable and scalable, we have tidied up the dataset so that it does not directly match the training code. We will subsequently update the training code to match the Huggingface dataset, which is organised into single images, pairs, and groups (e.g. 8k, 8k_pair, 8k_group), so you can choose the one that is most convenient for you.
@xujz18
Huge thanks for the quick reply. May I ask when it will be released? If it is convenient, can you give me some suggestions so that I can solve this problem more quickly?
We apologize that the update of the training code may be a week or two away as the friend who is responsible for putting this part of the training code together has been busy with some important matters of his own recently. For suggestions to solve quickly: The data format corresponding to the code of make_dataset.py is the same as https://github.com/THUDM/ImageReward/blob/main/data/test.json, you can use ImageRewardDB "8k_group" (you can refer to preview on HuggingFace to get a visual impression) and turn its format to "test.json". Or, more directly, you can modify the code of rank_pair_dataset.py to correspond to HuggingFace ImageRewardDB.
We apologize that the update of the training code may be a week or two away as the friend who is responsible for putting this part of the training code together has been busy with some important matters of his own recently. For suggestions to solve quickly: The data format corresponding to the code of make_dataset.py is the same as https://github.com/THUDM/ImageReward/blob/main/data/test.json, you can use ImageRewardDB "8k_group" (you can refer to preview on HuggingFace to get a visual impression) and turn its format to "test.json". Or, more directly, you can modify the code of rank_pair_dataset.py to correspond to HuggingFace ImageRewardDB.
Thanks @xujz18
Thanks a lot! I have one last question, how can I use ImageRewardDB "8k_group"
as you mentioned just now? Can I load the 8k_group
or 8k_pair
subset directly with datasets.load_datasets
?
Yes. Like this:
load_dataset("THUDM/ImageRewardDB", "8k_group")
Yes. Like this:
load_dataset("THUDM/ImageRewardDB", "8k_group")
Thanks a lot!
Yes. Like this:
load_dataset("THUDM/ImageRewardDB", "8k_group")
Hi, @xujz18
I tried to download the 8k_group
follow your instruction with following code:
from datasets import load_dataset
dataset = load_dataset("THUDM/ImageRewardDB", "8k_group", num_proc=8)
dataset.save_to_disk("data/ImageRewardDB_8k_group")
However, there are errors:
Found cached dataset image_reward_db (/Users/bytedance/.cache/huggingface/datasets/THUDM___image_reward_db/8k_group/1.0.0/33d18fdde6cd866eeeab2de1471592b802627df4ade050865b4e88c500ee63b7)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 165.72it/s]
Traceback (most recent call last):
File "/Users/bytedance/Code/download.py", line 4, in <module>
dataset.save_to_disk("data/ImageRewardDB_8k_group")
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/dataset_dict.py", line 1225, in save_to_disk
dataset.save_to_disk(
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/arrow_dataset.py", line 1421, in save_to_disk
for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/arrow_dataset.py", line 1458, in _save_to_disk_single
writer.write_table(pa_table)
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/arrow_writer.py", line 570, in write_table
pa_table = embed_table_storage(pa_table)
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 2290, in embed_table_storage
arrays = [
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 2291, in <listcomp>
embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 1837, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 1837, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 2190, in embed_array_storage
casted_values = _e(array.values, feature.feature)
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 1839, in wrapper
return func(array, *args, **kwargs)
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 2164, in embed_array_storage
return feature.embed_storage(array)
File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/features/image.py", line 263, in embed_storage
storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
File "pyarrow/array.pxi", line 2788, in pyarrow.lib.StructArray.from_arrays
File "pyarrow/array.pxi", line 3243, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean
Besides, this error happens both for 8k_group
and 8k_pair
We apologize that the update of the training code may be a week or two away as the friend who is responsible for putting this part of the training code together has been busy with some important matters of his own recently. For suggestions to solve quickly: The data format corresponding to the code of make_dataset.py is the same as https://github.com/THUDM/ImageReward/blob/main/data/test.json, you can use ImageRewardDB "8k_group" (you can refer to preview on HuggingFace to get a visual impression) and turn its format to "test.json". Or, more directly, you can modify the code of rank_pair_dataset.py to correspond to HuggingFace ImageRewardDB.
anything news? The train code have finished?
We apologize that the update of the training code may be a week or two away as the friend who is responsible for putting this part of the training code together has been busy with some important matters of his own recently. For suggestions to solve quickly: The data format corresponding to the code of make_dataset.py is the same as https://github.com/THUDM/ImageReward/blob/main/data/test.json, you can use ImageRewardDB "8k_group" (you can refer to preview on HuggingFace to get a visual impression) and turn its format to "test.json". Or, more directly, you can modify the code of rank_pair_dataset.py to correspond to HuggingFace ImageRewardDB.
I try both two ways, but raise the same error.
cannot identify image file 'cache_dir_\\downloads\\extracted\\ba65aabab9974598536781d2df59b59457a2473f4a5341d7c0e8d0dc1988830f\\deb066d4-54aa-4562-8d30-2c67a6badb98.webp' cannot identify image file 'cache_dir_\\downloads\\extracted\\ba65aabab9974598536781d2df59b59457a2473f4a5341d7c0e8d0dc1988830f\\832ff14c-14cd-4d35-965d-bd2c1616d598.webp' cannot identify image file 'cache_dir_\\downloads\\extracted\\ba65aabab9974598536781d2df59b59457a2473f4a5341d7c0e8d0dc1988830f\\99dddbdd-a5d3-41af-98f7-a2f8927405fe.webp'
It seems that there are few invalid images in HF.
Yes. Like this:
load_dataset("THUDM/ImageRewardDB", "8k_group")
If I run load_dataset("THUDM/ImageRewardDB", "8k_pair")
. I will get a error.
Traceback (most recent call last):
File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/builder.py", line 1637, in _prepare_split_single
num_examples, num_bytes = writer.finalize()
File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/arrow_writer.py", line 579, in finalize
self.check_duplicate_keys()
File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/arrow_writer.py", line 501, in check_duplicate_keys
raise DuplicatedKeysError(key, duplicate_key_indices)
datasets.keyhash.DuplicatedKeysError: Found multiple examples generated with the same key
The examples at index 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 have the key 000904-0035
Hi, @xujz18 @Xiao9905
Thanks for this nice contribution. I noticed that we can load ImageReward data with:
datasets.load_dataset("THUDM/ImageRewardDB", "8k")
However, the loaded data seem to not match with existing code, I have no idea how to move on with these code (I downloaded these HuggingFace data and save to disk, so I use
load_from_disk
to load them):train_dataset = load_from_disk("data/RLHF/ImageRewardDB_8k/train") valid_dataset = load_from_disk("data/RLHF/ImageRewardDB_8k/validation") test_dataset = load_from_disk("data/RLHF/ImageRewardDB_8k/test")
When I print
train_dataset[0].keys()
, it shows the same results in HuggingFace Dataset introduction:dict_keys(['image', 'prompt_id', 'prompt', 'classification', 'image_amount_in_total', 'rank', 'overall_rating', 'image_text_alignment_rating', 'fidelity_rating'])
When I run
python src/make_dataset.py
, following the instruction in the README, this error happens:making dataset: 0%| | 0/10000 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/tiger/code/ImageReward/train/src/make_dataset.py", line 12, in <module> train_dataset = rank_pair_dataset("train") File "/home/tiger/code/ImageReward/train/src/rank_pair_dataset.py", line 59, in __init__ self.data = self.make_data() File "/home/tiger/code/ImageReward/train/src/rank_pair_dataset.py", line 80, in make_data for generations in item["generations"]: KeyError: 'generations'
Unfortunately, it is not compatible with the existing datasets code:
Does this mean we have to re-write the code if we want to use the downloaded dataset from HuggingFace?
Hello, I am also working on reproducing the training results, but I found the 'train.json' file in huggingface seems cannot be directly used for make_dataset.py. Could you share the processed train.json file? many thanks!
Yes. Like this:
load_dataset("THUDM/ImageRewardDB", "8k_group")
If I run . I will get a error.
load_dataset("THUDM/ImageRewardDB", "8k_pair")
Traceback (most recent call last): File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/builder.py", line 1637, in _prepare_split_single num_examples, num_bytes = writer.finalize() File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/arrow_writer.py", line 579, in finalize self.check_duplicate_keys() File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/arrow_writer.py", line 501, in check_duplicate_keys raise DuplicatedKeysError(key, duplicate_key_indices) datasets.keyhash.DuplicatedKeysError: Found multiple examples generated with the same key The examples at index 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 have the key 000904-0035
Have you solved it yet?
Hi, @xujz18 @Xiao9905
Thanks for this nice contribution. I noticed that we can load ImageReward data with:
datasets.load_dataset("THUDM/ImageRewardDB", "8k")
However, the loaded data seem to not match with existing code, I have no idea how to move on with these code (I downloaded these HuggingFace data and save to disk, so I use
load_from_disk
to load them):When I print
train_dataset[0].keys()
, it shows the same results in HuggingFace Dataset introduction:When I run
python src/make_dataset.py
, following the instruction in the README, this error happens:Unfortunately, it is not compatible with the existing datasets code: https://github.com/THUDM/ImageReward/blob/1beb4e4de0932acbe7fc090c51208048b6269b58/train/src/rank_pair_dataset.py#L47
Does this mean we have to re-write the code if we want to use the downloaded dataset from HuggingFace?