Closed riccardo-unipg closed 1 month ago
@riccardo-unipg Could you please upgrade to 1.3.1 and let us know the results? There was a bug in OpenAILLM where it did not use the timeout argument that was fixed recently.
Hi @riccardo-unipg I tried running the pipeline slightly modified (used InferenceEndpoints
instead of the OpenAILLM
, and loaded some dummy data):
from distilabel.llms import OpenAILLM, InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EvolInstruct
from distilabel.steps import LoadDataFromFileSystem, LoadDataFromDicts
from pathlib import Path
import os
os.environ["OPENAI_API_KEY"] = "EMPTY"
with Pipeline(name="DEITA") as pipeline:
loader = LoadDataFromDicts(
data=[
{"instruction": "Tell me a joke."},
] * 10,
batch_size=2
)
# loader = LoadDataFromFileSystem(
# data_files="./dataset/faq_dataset_cleaned.csv",
# output_mappings={"Domanda": "instruction"},
# batch_size=16
# )
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct"
)
# llm = OpenAILLM(
# model="meta-llama/Meta-Llama-3.1-70B-Instruct",
# base_url="http://172.18.21.136:8000/v1",
# timeout=15000,
# )
evol_instruction_complexity = EvolInstruct(
name="evol_instruction_complexity",
llm=llm,
num_evolutions=4,
store_evolutions=True,
generate_answers=True,
include_original_instruction=False,
input_batch_size=8
)
loader.connect(evol_instruction_complexity)
if __name__ == "__main__":
distiset = pipeline.dry_run(
# distiset = pipeline.run(
parameters={
evol_instruction_complexity.name: {
"llm": {
"generation_kwargs": {
"max_new_tokens": 1024,
"temperature": 0.8,
"top_p": 0.8
}
},
}
},
# use_cache=False,
)
distiset.save_to_disk(
Path.home() / "Downloads/my-dataset-test",
save_card=True,
save_pipeline_config=True,
save_pipeline_log=True
)
These are the logs for the dry_run
instead of the full run:
[08/08/24 09:00:04] INFO ['distilabel.pipeline'] 🌵 Dry run mode base.py:375
INFO ['distilabel.pipeline'] 📝 Pipeline data will be written to '/Users/agus/.cache/distilabel/pipelines/DEITA/dcff0ae8e0a134b9ba0673a824cbc5582aca2b71/data' base.py:696
INFO ['distilabel.pipeline'] ⌛ The steps of the pipeline will be loaded in stages: base.py:705
* Stage 0: ['load_data_from_dicts_0', 'evol_instruction_complexity']
[08/08/24 09:00:05] INFO ['distilabel.pipeline'] ⏳ Waiting for all the steps of stage 0 to load... base.py:918
[08/08/24 09:00:07] WARNING ['distilabel.llm.meta-llama/Meta-Llama-3.1-70B-Instruct'] Since the `base_url=https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-70B-Instruct` is available and either one of inference_endpoints.py:190
`model_id` or `endpoint_name` is also provided, the `base_url` will either be ignored or overwritten with the one generated from either of those args, for serverless or dedicated inference endpoints,
respectively.
[08/08/24 09:00:08] INFO ['distilabel.pipeline'] ⏳ Steps from stage 0 loaded: 2/2 base.py:950
* 'load_data_from_dicts_0' replicas: 1/1
* 'evol_instruction_complexity' replicas: 1/1
INFO ['distilabel.pipeline'] ✅ All the steps from stage 0 have been loaded! base.py:954
INFO ['distilabel.step.load_data_from_dicts_0'] 🧬 Starting yielding batches from generator step 'load_data_from_dicts_0'. Offset: 0 step_wrapper.py:167
INFO ['distilabel.step.load_data_from_dicts_0'] 📨 Step 'load_data_from_dicts_0' sending batch 0 to output queue step_wrapper.py:274
INFO ['distilabel.step.load_data_from_dicts_0'] 🏁 Finished running step 'load_data_from_dicts_0' (replica ID: 0) step_wrapper.py:127
INFO ['distilabel.step.evol_instruction_complexity'] 📦 Processing batch 0 in 'evol_instruction_complexity' (replica ID: 0) step_wrapper.py:217
[08/08/24 09:00:15] INFO ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 0 evolving 1 instructions! base.py:306
[08/08/24 09:00:22] INFO ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 1 evolving 1 instructions! base.py:306
[08/08/24 09:00:34] INFO ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 2 evolving 1 instructions! base.py:306
[08/08/24 09:00:50] INFO ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 3 evolving 1 instructions! base.py:306
INFO ['distilabel.step.evol_instruction_complexity'] 🎉 Finished evolving 1 instructions! base.py:372
INFO ['distilabel.step.evol_instruction_complexity'] 🧠 Generating answers for the 1 evolved instructions! base.py:377
[08/08/24 09:01:14] INFO ['distilabel.step.evol_instruction_complexity'] 🎉 Finished generating answers for the 1 evolved instructions! base.py:383
INFO ['distilabel.step.evol_instruction_complexity'] 📨 Step 'evol_instruction_complexity' sending batch 0 to output queue step_wrapper.py:274
INFO ['distilabel.step.evol_instruction_complexity'] 🏁 Finished running step 'evol_instruction_complexity' (replica ID: 0) step_wrapper.py:127
Generating train split: 1 examples [00:00, 827.28 examples/s]
Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 558.05 examples/s]
I cannot see where the ValueError
comes from, so it's a bit hard to say what's happening. Could you send the full error trace? Also, could you check the model that runs in the API you are calling works as expected? Maybe something failed on that end, but I cannot see it from these logs
@riccardo-unipg Could you please upgrade to 1.3.1 and let us know the results? There was a bug in OpenAILLM where it did not use the timeout argument that was fixed recently.
it's running now, I'll let you know as soon as it finishes, thank you very much.
Hi @riccardo-unipg I tried running the pipeline slightly modified (used
InferenceEndpoints
instead of theOpenAILLM
, and loaded some dummy data):from distilabel.llms import OpenAILLM, InferenceEndpointsLLM from distilabel.pipeline import Pipeline from distilabel.steps.tasks import EvolInstruct from distilabel.steps import LoadDataFromFileSystem, LoadDataFromDicts from pathlib import Path import os os.environ["OPENAI_API_KEY"] = "EMPTY" with Pipeline(name="DEITA") as pipeline: loader = LoadDataFromDicts( data=[ {"instruction": "Tell me a joke."}, ] * 10, batch_size=2 ) # loader = LoadDataFromFileSystem( # data_files="./dataset/faq_dataset_cleaned.csv", # output_mappings={"Domanda": "instruction"}, # batch_size=16 # ) llm = InferenceEndpointsLLM( model_id="meta-llama/Meta-Llama-3.1-70B-Instruct" ) # llm = OpenAILLM( # model="meta-llama/Meta-Llama-3.1-70B-Instruct", # base_url="http://172.18.21.136:8000/v1", # timeout=15000, # ) evol_instruction_complexity = EvolInstruct( name="evol_instruction_complexity", llm=llm, num_evolutions=4, store_evolutions=True, generate_answers=True, include_original_instruction=False, input_batch_size=8 ) loader.connect(evol_instruction_complexity) if __name__ == "__main__": distiset = pipeline.dry_run( # distiset = pipeline.run( parameters={ evol_instruction_complexity.name: { "llm": { "generation_kwargs": { "max_new_tokens": 1024, "temperature": 0.8, "top_p": 0.8 } }, } }, # use_cache=False, ) distiset.save_to_disk( Path.home() / "Downloads/my-dataset-test", save_card=True, save_pipeline_config=True, save_pipeline_log=True )
These are the logs for the
dry_run
instead of the full run:[08/08/24 09:00:04] INFO ['distilabel.pipeline'] 🌵 Dry run mode base.py:375 INFO ['distilabel.pipeline'] 📝 Pipeline data will be written to '/Users/agus/.cache/distilabel/pipelines/DEITA/dcff0ae8e0a134b9ba0673a824cbc5582aca2b71/data' base.py:696 INFO ['distilabel.pipeline'] ⌛ The steps of the pipeline will be loaded in stages: base.py:705 * Stage 0: ['load_data_from_dicts_0', 'evol_instruction_complexity'] [08/08/24 09:00:05] INFO ['distilabel.pipeline'] ⏳ Waiting for all the steps of stage 0 to load... base.py:918 [08/08/24 09:00:07] WARNING ['distilabel.llm.meta-llama/Meta-Llama-3.1-70B-Instruct'] Since the `base_url=https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-70B-Instruct` is available and either one of inference_endpoints.py:190 `model_id` or `endpoint_name` is also provided, the `base_url` will either be ignored or overwritten with the one generated from either of those args, for serverless or dedicated inference endpoints, respectively. [08/08/24 09:00:08] INFO ['distilabel.pipeline'] ⏳ Steps from stage 0 loaded: 2/2 base.py:950 * 'load_data_from_dicts_0' replicas: 1/1 * 'evol_instruction_complexity' replicas: 1/1 INFO ['distilabel.pipeline'] ✅ All the steps from stage 0 have been loaded! base.py:954 INFO ['distilabel.step.load_data_from_dicts_0'] 🧬 Starting yielding batches from generator step 'load_data_from_dicts_0'. Offset: 0 step_wrapper.py:167 INFO ['distilabel.step.load_data_from_dicts_0'] 📨 Step 'load_data_from_dicts_0' sending batch 0 to output queue step_wrapper.py:274 INFO ['distilabel.step.load_data_from_dicts_0'] 🏁 Finished running step 'load_data_from_dicts_0' (replica ID: 0) step_wrapper.py:127 INFO ['distilabel.step.evol_instruction_complexity'] 📦 Processing batch 0 in 'evol_instruction_complexity' (replica ID: 0) step_wrapper.py:217 [08/08/24 09:00:15] INFO ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 0 evolving 1 instructions! base.py:306 [08/08/24 09:00:22] INFO ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 1 evolving 1 instructions! base.py:306 [08/08/24 09:00:34] INFO ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 2 evolving 1 instructions! base.py:306 [08/08/24 09:00:50] INFO ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 3 evolving 1 instructions! base.py:306 INFO ['distilabel.step.evol_instruction_complexity'] 🎉 Finished evolving 1 instructions! base.py:372 INFO ['distilabel.step.evol_instruction_complexity'] 🧠 Generating answers for the 1 evolved instructions! base.py:377 [08/08/24 09:01:14] INFO ['distilabel.step.evol_instruction_complexity'] 🎉 Finished generating answers for the 1 evolved instructions! base.py:383 INFO ['distilabel.step.evol_instruction_complexity'] 📨 Step 'evol_instruction_complexity' sending batch 0 to output queue step_wrapper.py:274 INFO ['distilabel.step.evol_instruction_complexity'] 🏁 Finished running step 'evol_instruction_complexity' (replica ID: 0) step_wrapper.py:127 Generating train split: 1 examples [00:00, 827.28 examples/s] Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 558.05 examples/s]
I cannot see where the
ValueError
comes from, so it's a bit hard to say what's happening. Could you send the full error trace? Also, could you check the model that runs in the API you are calling works as expected? Maybe something failed on that end, but I cannot see it from these logs
Hi, thank you very much for your help. If I pass a few samples and the process is short (like in your example) there are no problems, everything works properly. The problem is when it has to produce a lot of data and it runs for hours. Yes the exposed model works well. Now I'm trying to see if with version 1.3.1 I can solve it, if not I'll copy the whole output here but it's really very long.
Thanks again.
Perfect, let's see it the timeout bug was the cause, thanks!
Thank you all, now it works (with 1.3.1 version), consequently the function save_to_disk() don't work with this huge dataset:
distiset.save_to_disk(
"my-dataset",
save_card=True,
save_pipeline_config=True,
save_pipeline_log=True
)
but i try to resolve pushing it to Hugging face repository. Thank you again.
nice to hear that it works! not sure about the problem of saving the dataset to disk, haven't tested it with big datasets, but the functionality should be tested from the datasets
side. Closing the issue for the moment, feel free to open another if necessary 😄
Describe the bug When I start the pipeline, in which there is the EvolInstruct task, it starts to iterate and generate evolved instructs and it works for many cycles, for about an hour it worked, then suddenly it stops and gives the following error:
[08/07/24 19:34:31] WARNING ['distilabel.step.evol_instruction_complexity'] ⚠️ Processing batch step_wrapper.py:240 13 with step 'evol_instruction_complexity' failed. Sending empty
batch filled with
None
s...Exception in thread Thread-10 (_output_queue_loop): Traceback (most recent call last): File "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/threading.py", line 1045, in _bootstrap_inner WARNING ['distilabel.step.evol_instruction_complexity'] Subprocess step_wrapper.py:244 traceback:
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/httpcore/_backends/anyio.py", line 32, in read
with map_exceptions(exc_map):
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/contextlib.py", line 158, in exit
self.gen.throw(typ, value, traceback)
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/httpcore/_exceptions.py", line 14, in
map_exceptions
raise to_exc(exc) from exc
httpcore.ReadTimeout
it continue for long, and finally say: ValueError: Target schema's field names are not matching the table's field names: ['instruction', 'Risposta', 'evolved_instructions', 'model_name', 'answers'], ['instruction', 'Risposta', 'evolved_instructions', 'answers', 'model_name']
To Reproduce Code to reproduce
Expected behaviour A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context Add any other context about the problem here.