Closed drewskidang closed 7 months ago
Yes! you should use LLMPool:
https://distilabel.argilla.io/latest/technical-reference/llms/#processllm-and-llmpool
There's some examples there but let us know if there's doubts
Thank you!! i'm also having this error when i changed the generations from 2-3
Hi @drewskidang! Apparently that issue happens because during the FeedbackDataset
creation in Argilla, those keys are not created, but then present on the records, so that it fails while trying to add the suggestions for those. Could you please send me script to reproduce? Thanks in advance 🤗
thank you ... sorry but i can't find the notebook do you have an example of uploading custom datasets?
Yes, indeed once the dataset has been generated via Pipeline.generate
then only to_argilla
is needed to convert the datasets.Dataset
into argilla.FeedbackDataset
, and to later upload it to Argilla push_to_argilla
.
@alvarobartt i mean i have my own custom dataset thats already made
Oh fair, did you upload it to the HuggingFace Hub or somewhere? Also, what did you mean with i'm also having this error when i changed the generations from 2-3
?
The datasets i uploaded to huggingface i also have private jsonl files that i would like to annotate. I was following this example but changed the code below
preference_dataset = preference_pipeline.generate( instructions_dataset, # type: ignore num_generations=2, #### i change to 3 and i got the error batch_size=8, display_progress_bar=True, )
would it be possible to get fireworkai intergration as well
The datasets i uploaded to huggingface i also have private jsonl files that i would like to annotate.
@drewskidang reusing your dataset should be relatively straightforward, you should create a hf Dataset object and prepare the data in the format expected by the task in the distilabel Pipeline.
For example, if you want to use the PreferenceTask (for rating generations) you should create/rename a column as generations
with a list of your LLM responses (the len of the list should be reflected with the num_generations arg when running pipeline.generate())
If you can share pseudo code or fake dataset examples and what you'd like to achieve we can guide you through
Sorry I have a question if the set up is right. Im trying to use two models for the preference dataset
from distilabel.tasks import UltraFeedbackTask
from distilabel.llm import LLM, LLMPool, ProcessLLM
from distilabel.tasks import Task, TextGenerationTask
def load_yi(task: Task) -> LLM:
from distilabel.llm import OpenAILLM
return TogetherInferenceLLM(
model="zero-one-ai/Yi-34B-Chat",
api_key='',
task=task,
num_threads=4,
)
def load_together(task: Task) -> LLM:
from distilabel.llm import OpenAILLM
return TogetherInferenceLLM(
model='NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO',
api_key='',
max_new_tokens=1048,
task=task,
num_threads=4
)
pool = LLMPool(
llms=[
ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_yi),
ProcessLLM(task=TextGenerationTask(), load_llm_fn=load_together),
]
)
preference_labeller = TogetherInferenceLLM(
model='snorkelai/Snorkel-Mistral-PairRM-DPO',
api_key='',
task=UltraFeedbackTask.for_instruction_following(),
num_threads=8,
max_new_tokens=512,
)
preference_pipeline = pipeline(
"preference",
"instruction-following",
generator=pool,
labeller=preference_labeller,
temperature=0.0,
)
Hi @drewskidang, sorry for not replying earlier! We're about to release distilabel 1.0.0
and the API will change a bit, so we're closing issues related to the old version. Feel free to reopen the issue if you consider it.
can we use more than one model at once for the preference dataset generation