use custom datasets and cache_dir

OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)

https://openrlhf.readthedocs.io/

Apache License 2.0

1.72k stars 160 forks source link

use custom datasets and cache_dir #259

Closed UbeCc closed 2 months ago

UbeCc commented 3 months ago

Hello! I've been highly impressed by OpenRLHF's accessibility and easy-to-use features. But I've no idea how to use custom settings except for manually modify the source code.

I have two questions for this

How to use my custom dataset that is not in the standard format
How to use my specific cache_dir instead of the default one.

I found the first question to be "Implelemented Error". If needed, can I contribute to the two undeployed questions?

wuxibin89 commented 3 months ago

Yes, we're very welcome for your contribution.

hijkzzz commented 3 months ago

Hi, OpenRLHF supports custom key name for the private JSON datasets (--input_key), see https://github.com/OpenLLMAI/OpenRLHF/blob/b8fda3644733fda1efbfb847fdd0673a11eaf91f/openrlhf/datasets/prompts_dataset.py#L9

UbeCc commented 3 months ago

Got it. Thanks!

catqaq commented 2 months ago

"How to use my specific cache_dir instead of the default one." For cache_dir, we use hugging face dataset api.

For custom data, it is best to do some necessary preprocessing to align the format. In the future we may add a little documentation to explain how to align to a unified format or consider adding more flexibility to our data processing module, but it is not a high priority.

Feel free to reopen it~

Hi, OpenRLHF supports custom key name for the private JSON datasets (--input_key), see

https://github.com/OpenLLMAI/OpenRLHF/blob/b8fda3644733fda1efbfb847fdd0673a11eaf91f/openrlhf/datasets/prompts_dataset.py#L9