EvolInstruct pipeline bug

riccardo-unipg commented 1 month ago

Describe the bug When I start the pipeline, in which there is the EvolInstruct task, it starts to iterate and generate evolved instructs and it works for many cycles, for about an hour it worked, then suddenly it stops and gives the following error:

[08/07/24 19:34:31] WARNING ['distilabel.step.evol_instruction_complexity'] ⚠️ Processing batch step_wrapper.py:240 13 with step 'evol_instruction_complexity' failed. Sending empty
batch filled with Nones...
Exception in thread Thread-10 (_output_queue_loop): Traceback (most recent call last): File "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/threading.py", line 1045, in _bootstrap_inner WARNING ['distilabel.step.evol_instruction_complexity'] Subprocess step_wrapper.py:244 traceback:

                         Traceback (most recent call last):                                                    
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpx/_transports/default.py", line 69, in                            
                         map_httpcore_exceptions                                                               
                             yield                                                                             
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpx/_transports/default.py", line 373, in                           
                         handle_async_request                                                                  
                             resp = await self._pool.handle_async_request(req)                                 
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                 
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpcore/_async/connection_pool.py", line 216, in                     
                         handle_async_request                                                                  
                             raise exc from None                                                               
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpcore/_async/connection_pool.py", line 196, in                     
                         handle_async_request                                                                  
                             response = await connection.handle_async_request(                                 
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                 
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpcore/_async/connection.py", line 101, in                          
                         handle_async_request                                                                  
                             return await self._connection.handle_async_request(request)                       
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                       
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpcore/_async/http11.py", line 143, in                              
                         handle_async_request                                                                  
                             raise exc                                                                         
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpcore/_async/http11.py", line 113, in                              
                         handle_async_request                                                                  
                             ) = await self._receive_response_headers(**kwargs)                                
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpcore/_async/http11.py", line 186, in                              
                         _receive_response_headers                                                             
                             event = await self._receive_event(timeout=timeout)                                
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpcore/_async/http11.py", line 224, in                              
                         _receive_event                                                                        
                             data = await self._network_stream.read(                                           
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/httpcore/_backends/anyio.py", line 32, in read
with map_exceptions(exc_map):
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/contextlib.py", line 158, in exit
self.gen.throw(typ, value, traceback)
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/httpcore/_exceptions.py", line 14, in
map_exceptions
raise to_exc(exc) from exc
httpcore.ReadTimeout

                         The above exception was the direct cause of the following                             
                         exception:                                                                            

                         Traceback (most recent call last):                                                    
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/openai/_base_client.py", line 1558, in _request                       
                             response = await self._client.send(                                               
                                        ^^^^^^^^^^^^^^^^^^^^^^^^                                               
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpx/_client.py", line 1661, in send                                 
                             response = await self._send_handling_auth(                                        
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                        
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpx/_client.py", line 1689, in                                      
                         _send_handling_auth                                                                   
                             response = await self._send_handling_redirects(                                   
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                   
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpx/_client.py", line 1726, in                                      
                         _send_handling_redirects                                                              
                             response = await self._send_single_request(request)                               
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                               
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpx/_client.py", line 1763, in                                      
                         _send_single_request                                                                  
                             response = await transport.handle_async_request(request)                          
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                          
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpx/_transports/default.py", line 372, in                           
                         handle_async_request                                                                  
                             with map_httpcore_exceptions():                                                   
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/contextlib.py", line 158, in __exit__                                               
                             self.gen.throw(typ, value, traceback)                                             
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/httpx/_transports/default.py", line 86, in                            
                         map_httpcore_exceptions                                                               
                             raise mapped_exc(message) from exc                                                
                         httpx.ReadTimeout                                                                     

                         The above exception was the direct cause of the following                             
                         exception:                                                                            

                         Traceback (most recent call last):                                                    
                           File

it continue for long, and finally say: ValueError: Target schema's field names are not matching the table's field names: ['instruction', 'Risposta', 'evolved_instructions', 'model_name', 'answers'], ['instruction', 'Risposta', 'evolved_instructions', 'answers', 'model_name']

To Reproduce Code to reproduce

from distilabel.llms import TransformersLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import ConversationTemplate, DeitaFiltering, ExpandColumns, LoadDataFromHub
from distilabel.steps.tasks import ComplexityScorer, EvolInstruct, EvolQuality, GenerateEmbeddings, QualityScorer
import pandas as pd
import os
os.environ["OPENAI_API_KEY"] = "EMPTY"
from distilabel.steps import LoadDataFromFileSystem

pipeline = Pipeline(name="DEITA")
loader = LoadDataFromFileSystem(data_files="./dataset/faq_dataset_cleaned.csv", pipeline=pipeline, output_mappings={"Domanda": "instruction"}, batch_size=16)
loader.load()
from distilabel.llms import OpenAILLM

llm = OpenAILLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    base_url="http://172.18.21.136:8000/v1",
    timeout=15000,
    generation_kwargs={
        "max_new_tokens": 1024,
        "temperature": 0.8,
        "top_p": 0.8
    }
)
evol_instruction_complexity = EvolInstruct(
    name="evol_instruction_complexity",
    llm=llm,
    num_evolutions=4,
    store_evolutions=True,
    generate_answers=True,
    include_original_instruction=False,
    pipeline=pipeline,
    input_batch_size=8
)
loader.connect(evol_instruction_complexity)
# response_quality_scorer.connect(expand_evolved_responses)

distiset = pipeline.run(
    parameters={
        "load_data_from_file_system_0": {
            "repo_id": "./dataset/faq_dataset_cleaned.csv",
            "batch_size":16
        },

        "evol_instruction_complexity": {
            "llm": {"generation_kwargs": {"max_new_tokens": 1024, "temperature": 0.8, "top_p":0.8}},
            "input_batch_size":8
        }
    },
    use_cache=False,
)
  distiset.save_to_disk(
        "my-dataset",
        save_card=True,
        save_pipeline_config=True,
        save_pipeline_log=True
    )

Expected behaviour A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Package version: Distilabel 1.3.0 (with 1.2.4 same problem)
Python version: 3.11.9

Additional context Add any other context about the problem here.

ashim-mahara commented 1 month ago

@riccardo-unipg Could you please upgrade to 1.3.1 and let us know the results? There was a bug in OpenAILLM where it did not use the timeout argument that was fixed recently.

plaguss commented 1 month ago

Hi @riccardo-unipg I tried running the pipeline slightly modified (used InferenceEndpoints instead of the OpenAILLM, and loaded some dummy data):

from distilabel.llms import OpenAILLM, InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EvolInstruct
from distilabel.steps import LoadDataFromFileSystem, LoadDataFromDicts
from pathlib import Path

import os
os.environ["OPENAI_API_KEY"] = "EMPTY"

with Pipeline(name="DEITA") as pipeline:
    loader = LoadDataFromDicts(
        data=[
            {"instruction": "Tell me a joke."},
        ] * 10,
        batch_size=2
    )
    # loader = LoadDataFromFileSystem(
    #     data_files="./dataset/faq_dataset_cleaned.csv",
    #     output_mappings={"Domanda": "instruction"},
    #     batch_size=16
    # )

    llm = InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )
    # llm = OpenAILLM(
    #     model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    #     base_url="http://172.18.21.136:8000/v1",
    #     timeout=15000,
    # )
    evol_instruction_complexity = EvolInstruct(
        name="evol_instruction_complexity",
        llm=llm,
        num_evolutions=4,
        store_evolutions=True,
        generate_answers=True,
        include_original_instruction=False,
        input_batch_size=8
    )
    loader.connect(evol_instruction_complexity)

if __name__ == "__main__":
    distiset = pipeline.dry_run(
    # distiset = pipeline.run(
        parameters={
            evol_instruction_complexity.name: {
                "llm": {
                    "generation_kwargs": {
                        "max_new_tokens": 1024,
                        "temperature": 0.8,
                        "top_p": 0.8
                    }
                },
            }
        },
        # use_cache=False,
    )

    distiset.save_to_disk(
        Path.home() / "Downloads/my-dataset-test",
        save_card=True,
        save_pipeline_config=True,
        save_pipeline_log=True
    )

These are the logs for the dry_run instead of the full run:

[08/08/24 09:00:04] INFO     ['distilabel.pipeline'] 🌵 Dry run mode                                                                                                                                                                                base.py:375
                    INFO     ['distilabel.pipeline'] 📝 Pipeline data will be written to '/Users/agus/.cache/distilabel/pipelines/DEITA/dcff0ae8e0a134b9ba0673a824cbc5582aca2b71/data'                                                              base.py:696
                    INFO     ['distilabel.pipeline'] ⌛ The steps of the pipeline will be loaded in stages:                                                                                                                                         base.py:705
                              * Stage 0: ['load_data_from_dicts_0', 'evol_instruction_complexity']
[08/08/24 09:00:05] INFO     ['distilabel.pipeline'] ⏳ Waiting for all the steps of stage 0 to load...                                                                                                                                             base.py:918
[08/08/24 09:00:07] WARNING  ['distilabel.llm.meta-llama/Meta-Llama-3.1-70B-Instruct'] Since the `base_url=https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-70B-Instruct` is available and either one of        inference_endpoints.py:190
                             `model_id` or `endpoint_name` is also provided, the `base_url` will either be ignored or overwritten with the one generated from either of those args, for serverless or dedicated inference endpoints,
                             respectively.
[08/08/24 09:00:08] INFO     ['distilabel.pipeline'] ⏳ Steps from stage 0 loaded: 2/2                                                                                                                                                              base.py:950
                              * 'load_data_from_dicts_0' replicas: 1/1
                              * 'evol_instruction_complexity' replicas: 1/1
                    INFO     ['distilabel.pipeline'] ✅ All the steps from stage 0 have been loaded!                                                                                                                                                base.py:954
                    INFO     ['distilabel.step.load_data_from_dicts_0'] 🧬 Starting yielding batches from generator step 'load_data_from_dicts_0'. Offset: 0                                                                                step_wrapper.py:167
                    INFO     ['distilabel.step.load_data_from_dicts_0'] 📨 Step 'load_data_from_dicts_0' sending batch 0 to output queue                                                                                                    step_wrapper.py:274
                    INFO     ['distilabel.step.load_data_from_dicts_0'] 🏁 Finished running step 'load_data_from_dicts_0' (replica ID: 0)                                                                                                   step_wrapper.py:127
                    INFO     ['distilabel.step.evol_instruction_complexity'] 📦 Processing batch 0 in 'evol_instruction_complexity' (replica ID: 0)                                                                                         step_wrapper.py:217
[08/08/24 09:00:15] INFO     ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 0 evolving 1 instructions!                                                                                                                            base.py:306
[08/08/24 09:00:22] INFO     ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 1 evolving 1 instructions!                                                                                                                            base.py:306
[08/08/24 09:00:34] INFO     ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 2 evolving 1 instructions!                                                                                                                            base.py:306
[08/08/24 09:00:50] INFO     ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 3 evolving 1 instructions!                                                                                                                            base.py:306
                    INFO     ['distilabel.step.evol_instruction_complexity'] 🎉 Finished evolving 1 instructions!                                                                                                                                   base.py:372
                    INFO     ['distilabel.step.evol_instruction_complexity'] 🧠 Generating answers for the 1 evolved instructions!                                                                                                                  base.py:377
[08/08/24 09:01:14] INFO     ['distilabel.step.evol_instruction_complexity'] 🎉 Finished generating answers for the 1 evolved instructions!                                                                                                         base.py:383
                    INFO     ['distilabel.step.evol_instruction_complexity'] 📨 Step 'evol_instruction_complexity' sending batch 0 to output queue                                                                                          step_wrapper.py:274
                    INFO     ['distilabel.step.evol_instruction_complexity'] 🏁 Finished running step 'evol_instruction_complexity' (replica ID: 0)                                                                                         step_wrapper.py:127
Generating train split: 1 examples [00:00, 827.28 examples/s]
Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 558.05 examples/s]

I cannot see where the ValueError comes from, so it's a bit hard to say what's happening. Could you send the full error trace? Also, could you check the model that runs in the API you are calling works as expected? Maybe something failed on that end, but I cannot see it from these logs

riccardo-unipg commented 1 month ago

@riccardo-unipg Could you please upgrade to 1.3.1 and let us know the results? There was a bug in OpenAILLM where it did not use the timeout argument that was fixed recently.

it's running now, I'll let you know as soon as it finishes, thank you very much.

riccardo-unipg commented 1 month ago

Hi @riccardo-unipg I tried running the pipeline slightly modified (used InferenceEndpoints instead of the OpenAILLM, and loaded some dummy data):

from distilabel.llms import OpenAILLM, InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import EvolInstruct
from distilabel.steps import LoadDataFromFileSystem, LoadDataFromDicts
from pathlib import Path

import os
os.environ["OPENAI_API_KEY"] = "EMPTY"

with Pipeline(name="DEITA") as pipeline:
    loader = LoadDataFromDicts(
        data=[
            {"instruction": "Tell me a joke."},
        ] * 10,
        batch_size=2
    )
    # loader = LoadDataFromFileSystem(
    #     data_files="./dataset/faq_dataset_cleaned.csv",
    #     output_mappings={"Domanda": "instruction"},
    #     batch_size=16
    # )

    llm = InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct"
    )
    # llm = OpenAILLM(
    #     model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    #     base_url="http://172.18.21.136:8000/v1",
    #     timeout=15000,
    # )
    evol_instruction_complexity = EvolInstruct(
        name="evol_instruction_complexity",
        llm=llm,
        num_evolutions=4,
        store_evolutions=True,
        generate_answers=True,
        include_original_instruction=False,
        input_batch_size=8
    )
    loader.connect(evol_instruction_complexity)

if __name__ == "__main__":
    distiset = pipeline.dry_run(
    # distiset = pipeline.run(
        parameters={
            evol_instruction_complexity.name: {
                "llm": {
                    "generation_kwargs": {
                        "max_new_tokens": 1024,
                        "temperature": 0.8,
                        "top_p": 0.8
                    }
                },
            }
        },
        # use_cache=False,
    )

    distiset.save_to_disk(
        Path.home() / "Downloads/my-dataset-test",
        save_card=True,
        save_pipeline_config=True,
        save_pipeline_log=True
    )

These are the logs for the dry_run instead of the full run:

[08/08/24 09:00:04] INFO     ['distilabel.pipeline'] 🌵 Dry run mode                                                                                                                                                                                base.py:375
                    INFO     ['distilabel.pipeline'] 📝 Pipeline data will be written to '/Users/agus/.cache/distilabel/pipelines/DEITA/dcff0ae8e0a134b9ba0673a824cbc5582aca2b71/data'                                                              base.py:696
                    INFO     ['distilabel.pipeline'] ⌛ The steps of the pipeline will be loaded in stages:                                                                                                                                         base.py:705
                              * Stage 0: ['load_data_from_dicts_0', 'evol_instruction_complexity']
[08/08/24 09:00:05] INFO     ['distilabel.pipeline'] ⏳ Waiting for all the steps of stage 0 to load...                                                                                                                                             base.py:918
[08/08/24 09:00:07] WARNING  ['distilabel.llm.meta-llama/Meta-Llama-3.1-70B-Instruct'] Since the `base_url=https://api-inference.huggingface.co/models/meta-llama/Meta-Llama-3.1-70B-Instruct` is available and either one of        inference_endpoints.py:190
                             `model_id` or `endpoint_name` is also provided, the `base_url` will either be ignored or overwritten with the one generated from either of those args, for serverless or dedicated inference endpoints,
                             respectively.
[08/08/24 09:00:08] INFO     ['distilabel.pipeline'] ⏳ Steps from stage 0 loaded: 2/2                                                                                                                                                              base.py:950
                              * 'load_data_from_dicts_0' replicas: 1/1
                              * 'evol_instruction_complexity' replicas: 1/1
                    INFO     ['distilabel.pipeline'] ✅ All the steps from stage 0 have been loaded!                                                                                                                                                base.py:954
                    INFO     ['distilabel.step.load_data_from_dicts_0'] 🧬 Starting yielding batches from generator step 'load_data_from_dicts_0'. Offset: 0                                                                                step_wrapper.py:167
                    INFO     ['distilabel.step.load_data_from_dicts_0'] 📨 Step 'load_data_from_dicts_0' sending batch 0 to output queue                                                                                                    step_wrapper.py:274
                    INFO     ['distilabel.step.load_data_from_dicts_0'] 🏁 Finished running step 'load_data_from_dicts_0' (replica ID: 0)                                                                                                   step_wrapper.py:127
                    INFO     ['distilabel.step.evol_instruction_complexity'] 📦 Processing batch 0 in 'evol_instruction_complexity' (replica ID: 0)                                                                                         step_wrapper.py:217
[08/08/24 09:00:15] INFO     ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 0 evolving 1 instructions!                                                                                                                            base.py:306
[08/08/24 09:00:22] INFO     ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 1 evolving 1 instructions!                                                                                                                            base.py:306
[08/08/24 09:00:34] INFO     ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 2 evolving 1 instructions!                                                                                                                            base.py:306
[08/08/24 09:00:50] INFO     ['distilabel.step.evol_instruction_complexity'] 🔄 Ran iteration 3 evolving 1 instructions!                                                                                                                            base.py:306
                    INFO     ['distilabel.step.evol_instruction_complexity'] 🎉 Finished evolving 1 instructions!                                                                                                                                   base.py:372
                    INFO     ['distilabel.step.evol_instruction_complexity'] 🧠 Generating answers for the 1 evolved instructions!                                                                                                                  base.py:377
[08/08/24 09:01:14] INFO     ['distilabel.step.evol_instruction_complexity'] 🎉 Finished generating answers for the 1 evolved instructions!                                                                                                         base.py:383
                    INFO     ['distilabel.step.evol_instruction_complexity'] 📨 Step 'evol_instruction_complexity' sending batch 0 to output queue                                                                                          step_wrapper.py:274
                    INFO     ['distilabel.step.evol_instruction_complexity'] 🏁 Finished running step 'evol_instruction_complexity' (replica ID: 0)                                                                                         step_wrapper.py:127
Generating train split: 1 examples [00:00, 827.28 examples/s]
Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 558.05 examples/s]

I cannot see where the ValueError comes from, so it's a bit hard to say what's happening. Could you send the full error trace? Also, could you check the model that runs in the API you are calling works as expected? Maybe something failed on that end, but I cannot see it from these logs

Hi, thank you very much for your help. If I pass a few samples and the process is short (like in your example) there are no problems, everything works properly. The problem is when it has to produce a lot of data and it runs for hours. Yes the exposed model works well. Now I'm trying to see if with version 1.3.1 I can solve it, if not I'll copy the whole output here but it's really very long.

Thanks again.

plaguss commented 1 month ago

Perfect, let's see it the timeout bug was the cause, thanks!

riccardo-unipg commented 1 month ago

Thank you all, now it works (with 1.3.1 version), consequently the function save_to_disk() don't work with this huge dataset:

  distiset.save_to_disk(
        "my-dataset",
        save_card=True,
        save_pipeline_config=True,
        save_pipeline_log=True
    )

but i try to resolve pushing it to Hugging face repository. Thank you again.

plaguss commented 1 month ago

nice to hear that it works! not sure about the problem of saving the dataset to disk, haven't tested it with big datasets, but the functionality should be tested from the datasets side. Closing the issue for the moment, feel free to open another if necessary 😄

argilla-io / distilabel

EvolInstruct pipeline bug #864