argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.4k stars 95 forks source link

[BUG] Ultrafeedback pipeline don't work #879

Closed riccardo-unipg closed 3 weeks ago

riccardo-unipg commented 1 month ago

Describe the bug I want to use Ultrafeedback task in a pipeline, but i have already the dataset, so the pipeline include only loading the dataset and after pass it to ultrafeedback module. Doing this, give this error: WARNING ['distilabel.step.ultrafeedback'] ⚠️ Processing batch 12 with step step_wrapper.py:240 'ultrafeedback' failed. Sending empty batch filled with Nones...

WARNING ['distilabel.step.ultrafeedback'] Subprocess traceback: step_wrapper.py:244

                         Traceback (most recent call last):                                                    
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/distilabel/pipeline/step_wrapper.py", line 228, in                    
                         _non_generator_process_loop                                                           
                             result = next(step.process_applying_mappings(*batch.data))                        
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                        
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/distilabel/steps/base.py", line 545, in                               
                         process_applying_mappings                                                             
                             for output_rows in generator:                                                     
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/distilabel/steps/tasks/base.py", line 198, in                         
                         process                                                                               
                             outputs = self.llm.generate(                                                      
                                       ^^^^^^^^^^^^^^^^^^                                                      
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/distilabel/llms/base.py", line 357, in generate                       
                             return self.event_loop.run_until_complete(                                        
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                        
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/nest_asyncio.py", line 98, in run_until_complete                      
                             return f.result()                                                                 
                                    ^^^^^^^^^^                                                                 
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/asyncio/futures.py", line 203, in result                                            
                             raise self._exception.with_traceback(self._exception_tb)                          
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/asyncio/tasks.py", line 279, in __step                                              
                             result = coro.throw(exc)                                                          
                                      ^^^^^^^^^^^^^^^                                                          
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/distilabel/llms/base.py", line 327, in _agenerate                     
                             return await asyncio.gather(*tasks)                                               
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                               
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/asyncio/tasks.py", line 349, in __wakeup                                            
                             future.result()                                                                   
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/asyncio/tasks.py", line 277, in __step                                              
                             result = coro.send(None)                                                          
                                      ^^^^^^^^^^^^^^^                                                          
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/distilabel/llms/openai.py", line 268, in agenerate                    
                             completion = await                                                                
                         self._aclient.chat.completions.create(**kwargs)  # type: ignore                       
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                    
                         ^^^^                                                                                  
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/openai/resources/chat/completions.py", line 1295,                     
                         in create                                                                             
                             return await self._post(                                                          
                                    ^^^^^^^^^^^^^^^^^                                                          
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/openai/_base_client.py", line 1826, in post                           
                             return await self.request(cast_to, opts, stream=stream,                           
                         stream_cls=stream_cls)                                                                
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                    
                         ^^^^^^^^^^^^^^^^                                                                      
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/openai/_base_client.py", line 1519, in request                        
                             return await self._request(                                                       
                                    ^^^^^^^^^^^^^^^^^^^^                                                       
                           File                                                                                
                         "/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1                    
                         1/site-packages/openai/_base_client.py", line 1620, in _request                       
                             raise self._make_status_error_from_response(err.response) from                    
                         None                                                                                  
                         openai.BadRequestError: Error code: 400 - {'object': 'error',                         
                         'message': "This model's maximum context length is 24288 tokens.                      
                         However, you requested 74604 tokens (73580 in the messages, 1024                      
                         in the completion). Please reduce the length of the messages or                       
                         completion.", 'type': 'BadRequestError', 'param': None, 'code':                       
                         400}

To Reproduce Code to reproduce

from distilabel.llms import TransformersLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import ConversationTemplate, DeitaFiltering, ExpandColumns, LoadDataFromHub
from distilabel.steps.tasks import ComplexityScorer, EvolInstruct, EvolQuality, GenerateEmbeddings, QualityScorer, UltraFeedback
import pandas as pd
from huggingface_hub import notebook_login

import os
os.environ["OPENAI_API_KEY"] = "EMPTY"

from distilabel.steps import LoadDataFromFileSystem

pipeline = Pipeline(name="DEITA")

loader = LoadDataFromFileSystem(data_files="./dataset/distilabel_dataset.jsonl", pipeline=pipeline, batch_size=16)
loader.load()

from distilabel.llms import OpenAILLM

llm = OpenAILLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    base_url="http://172.18.21.136:8000/v1",
    timeout=15000,
    generation_kwargs={
        "max_new_tokens": 1024,
        "temperature": 0.8,
        "top_p": 0.8
    }
)

# evol_instruction_complexity = EvolInstruct(
#     name="evol_instruction_complexity",
#     llm=llm,
#     num_evolutions=4,
#     store_evolutions=True,
#     generate_answers=True,
#     include_original_instruction=False,
#     pipeline=pipeline,
#     input_batch_size=8
# )

    ultrafeedback = UltraFeedback(
        name="ultrafeedback",
        llm=llm,
        aspect="truthfulness",
        input_mappings={"instruction": "evolved_instructions", "generations": "answers"},
        output_mappings={"model_name": "ultrafeedback_model"},
        pipeline=pipeline,
        input_batch_size=8
    )

expand_evolved_instructions = ExpandColumns(
    name="expand_evolved_instructions",
    columns=['evolved_instructions', 'answers', 'types', 'ratings', 'rationales-for-ratings'],
    pipeline=pipeline,
)

# loader.connect(evol_instruction_complexity)
# evol_instruction_complexity.connect(ultrafeedback)
loader.connect(ultrafeedback)
ultrafeedback.connect(expand_evolved_instructions)

distiset = pipeline.run(
    parameters={
        "load_data_from_file_system_0": {
            "repo_id": "./dataset/distilabel_dataset.jsonl",
            "batch_size":16
        },

        # "evol_instruction_complexity": {
        #     "llm": {"generation_kwargs": {"max_new_tokens": 1024, "temperature": 0.8, "top_p":0.8}},
        #     "input_batch_size":8
        # },

        "ultrafeedback": {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 1024,
                    "temperature": 0.8,
                }
            },
            "input_batch_size":8
        },

    },
    use_cache=False,
)

Expected behaviour If you notice, in the code there are some commented parts, if i use ultrafeedback with that commented parts, for example after EvolInstruct module, it work very well, all i did is to comment the EvolInstruct Module and connect loader with ultrafeedback (because i have already the generated dataset) and it don't work anymore, giving problem with prompt lenght. I don't understand why.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

plaguss commented 1 month ago

HI @riccardo-unipg, I cannot see the examples in the distilabel_dataset.jsonl file. The error seems related to the length of the messages you have stored in the file. Maybe when you use the EvolInstruct the error is controlled there, will review it to see if that's the case. In the meantime, you could limit the length of the column you are using, maybe using the tokenizer of your model to limit the amount of tokens of the examples, so you can fit them in the context window.

Additionally, the UltraFeedback task expects the following format: https://distilabel.argilla.io/dev/components-gallery/tasks/ultrafeedback/#examples. Could you double check your dataset adheres to that format? (but given the error this doesn't seem the case)

riccardo-unipg commented 1 month ago

This model's maximum context length is 24288 tokens

Hi, thanks for the reply, I got confused and marked this issue as completed against my will, :(. Anyway, regarding the format that ultrafeedback wants, I made sure it was the right one through the line '''python input_mappings={"instruction": "evolved_instructions", "generations": "answers"} ''', in fact it doesn't give me any error of this type.

Regarding the length of the context, what amazes me is that the model's maximum context length is 24288 tokens, it's really big, it will never be filled by an evolved_instructions and an answers, that's what I can't understand, I also tried to reduce the number of batches by bringing it to 1, but it always gives me the same error.

gabrielmbmb commented 1 month ago

Hi @riccardo-unipg, can you try removing the loader.load() line of code? As you're not using LoadDataFromFileSystem as a standalone component, you shouldn't be calling load manually, it will be called automatically by the Pipeline. Also, check LoadDataFromFileSystem what's the content of the batches being yield by the step when used in combination with your ./dataset/distilabel_dataset.jsonl file. It can be happening that it's not reading the data correctly and it's returning a super long text because it's the whole content of the file.

riccardo-unipg commented 1 month ago

CIAO@riccardo-unipg, puoi provare a rimuovere la loader.load()riga di codice? Poiché non lo stai usando LoadDataFromFileSystemcome componente autonomo, non dovresti chiamarlo loadmanualmente, verrà chiamato automaticamente da Pipeline. Inoltre, controlla LoadDataFromFileSystemqual è il contenuto dei batch che vengono prodotti dal passaggio quando usato in combinazione con il tuo ./dataset/distilabel_dataset.jsonlfile. Può succedere che non legga correttamente i dati e che restituisca un testo molto lungo perché è l'intero contenuto del file.

When i remove loader.load()' i get: ValueError: Dataset not loaded yet, you must callload` method first.

riccardo-unipg commented 1 month ago

my question is, ulktrafeedback work only if i use it in a pipeline with evolving module like EvolInstruct or EvolQuality ? Or i can use it like in this section so with the dataset already ready?

riccardo-unipg commented 1 month ago

i also tried with LoadDataFromHub() and i remove loader.load() but it give me the same context length error

plaguss commented 1 month ago

When i remove loader.load()' i get: ValueError: Dataset not loaded yet, you must callload` method first.

Hi @riccardo-unipg , this seems a bug, it should be solved in the develop branch, but will check it. UPDATE -> the bug is solved in develop, using LoadDataFromFileSystem should be working normally.

plaguss commented 1 month ago

my question is, ulktrafeedback work only if i use it in a pipeline with evolving module like EvolInstruct or EvolQuality ? Or i can use it like in this section so with the dataset already ready?

The UltraFeedback task can work without using the EvolInstruct.

Can you test it outside of the pipeline with a sample row of your dataset like in this example from the documentation? Update to the LLM you are using and let's see if it works

riccardo-unipg commented 1 month ago

When i remove loader.load()' i get: ValueError: Dataset not loaded yet, you must callload` method first.

Hi @riccardo-unipg , this seems a bug, it should be solved in the develop branch, but will check it. UPDATE -> the bug is solved in develop, using LoadDataFromFileSystem should be working normally.

It still doesn't work, it always gives the same error, I also removed the loader.load() instruction which now doesn't give any more errors, but the main error remains, that is, the length of the context

riccardo-unipg commented 1 month ago

my question is, ulktrafeedback work only if i use it in a pipeline with evolving module like EvolInstruct or EvolQuality ? Or i can use it like in this section so with the dataset already ready?

The UltraFeedback task can work without using the EvolInstruct.

Can you test it outside of the pipeline with a sample row of your dataset like in this example from the documentation? Update to the LLM you are using and let's see if it works

Yeah, it work

riccardo-unipg commented 1 month ago

my question is, ulktrafeedback work only if i use it in a pipeline with evolving module like EvolInstruct or EvolQuality ? Or i can use it like in this section so with the dataset already ready?

The UltraFeedback task can work without using the EvolInstruct. Can you test it outside of the pipeline with a sample row of your dataset like in this example from the documentation? Update to the LLM you are using and let's see if it works

Yeah, it work

i tried to do all the dataset outside the pipeline, giving input_batch_size but it return the same contecxt length error. I think it'like input_batch_size parameter doesn't work, it load all the data in one batch and it is to long for processing it

plaguss commented 3 weeks ago

Can you do a dry_run and check the pipeline works with a single example? The examples are working, and I would think there's something happening with the data. If the dry_run works, you should take a look at your dataset because there can be long examples, and UltraFeedback needs to use some of them at the same time to feed them to the LLM

riccardo-unipg commented 3 weeks ago

Can you do a dry_run and check the pipeline works with a single example? The examples are working, and I would think there's something happening with the data. If the dry_run works, you should take a look at your dataset because there can be long examples, and UltraFeedback needs to use some of them at the same time to feed them to the LLM

Same error with dry_run and small batches

riccardo-unipg commented 3 weeks ago

[08/19/24 09:59:33] INFO ['distilabel.pipeline'] 🌵 Dry run mode base.py:377 INFO ['distilabel.pipeline'] 📝 Pipeline data will be written to base.py:724 '/home/jovyan/.cache/distilabel/pipelines/DEITA/3be6bdb6e7dc323cf5a8ac7fe3
69bd0bcb7c6072/data/steps_outputs'
INFO ['distilabel.pipeline'] ⌛ The steps of the pipeline will be loaded in base.py:733 stages:

riccardo-unipg commented 3 weeks ago

If this can help you, i have notice the same error iin the EvolInstruct module when set store_evolutions = False. It's possible that there is a problem when handle data that is not in a list but only a string?

plaguss commented 3 weeks ago

The dry_run takes a single example by default via input_batch_size. Can you check/share the first example in your distilabel_dataset.jsonl file?

riccardo-unipg commented 3 weeks ago

I share with you the HF link to the dataset: https://huggingface.co/datasets/Magic-Ric/prova

plaguss commented 3 weeks ago

The dataset is not following the format expected from UltraFeedback. The generations should be a list with the strings you want to pass. You need to update the dataset accordingly:

from distilabel.llms import InferenceEndpointsLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub, LoadDataFromDicts
from distilabel.steps.tasks import UltraFeedback

import os
# os.environ["OPENAI_API_KEY"] = "EMPTY"

with Pipeline(name="DEITA") as pipeline:
    # loader = LoadDataFromHub(repo_id="Magic-Ric/prova")
    loader = LoadDataFromDicts(
        data=[
            {
                "instruction": "Quali sono le principali differenze tra ictus ischemico e ictus emorragico in termini di cause, sintomi e trattamenti?",
                "generations": [
                    """Le differenze fondamentali tra ictus ischemico e ictus emorragico risiedono nelle cause, nei sintomi e nei trattamenti. L'ictus ischemico, che rappresenta circa l'80% degli eventi cerebrovascolari acuti, è causato da un'occlusione dei vasi sanguigni cerebrali a causa di un trombo o di un embolo, che porta a una riduzione o interruzione del flusso sanguigno cerebrale. Al contrario, l'ictus emorragico è causato da una rottura della parete dei vasi sanguigni cerebrali, che porta a un'emorragia all'interno del cervello. I sintomi dell'ictus ischemico possono includere debolezza o paralisi di un lato del corpo, difficoltà di parola, problemi di visione e perdita di equilibrio. I sintomi dell'ictus emorragico possono essere più gravi e includono cefalea intensa, vomito, confusione e perdita di coscienza. Il trattamento dell'ictus ischemico prevede l'uso di farmaci anticoagulanti e trombolitici per sciogliere il trombo e ripristinare il flusso sanguigno cerebrale. Il trattamento dell'ictus emorragico prevede l'uso di farmaci per controllare la pressione sanguigna e prevenire ulteriori emorragie, nonché interventi chirurgici per rimuovere il sangue accumulato nel cervello. È importante notare che entrambi i tipi di ictus possono avere conseguenze gravi e permanenti, come la disabilità e la morte. Pertanto, è fondamentale riconoscere i sintomi dell'ictus e cercare immediatamente assistenza medica in caso di sospetto. Inoltre, la prevenzione delle malattie cerebrovascolari attraverso la gestione dei fattori di rischio, come l'ipertensione, il diabete e il fumo, può aiutare a ridurre il rischio di ictus.""",
                    """Le differenze principali tra ictus lacunare e ictus cardioembolico possono essere analizzate in base a fattori di rischio, manifestazioni cliniche, strategie di prevenzione e possibili complicazioni a lungo termine, tenendo conto delle diverse popolazioni di pazienti e delle variazioni nella presentazione dei sintomi. In termini di fattori di rischio, l'ictus lacunare è spesso associato a ipertensione, diabete mellito e ipercolesterolemia, mentre l'ictus cardioembolico è più frequentemente legato a patologie cardiache come la fibrillazione atriale, la valvulopatia mitralica e la cardiomiopatia. Inoltre, l'ictus lacunare tende a colpire persone più giovani e di sesso maschile, mentre l'ictus cardioembolico è più comune nelle persone anziane e di sesso femminile. Le manifestazioni cliniche dell'ictus lacunare sono spesso caratterizzate da sintomi focali, come la paralisi di un arto o la perdita della sensibilità, mentre l'ictus cardioembolico può presentarsi con sintomi più diffusi, come la confusione, la perdita di coscienza e la paralisi di più arti. Inoltre, l'ictus lacunare tende a essere più lieve e reversibile, mentre l'ictus cardioembolico può essere più grave e avere conseguenze a lungo termine più significative. Le strategie di prevenzione per l'ictus lacunare si concentrano sull'controllo dei fattori di rischio, come la pressione sanguigna e il diabete, mentre la prevenzione dell'ictus cardioembolico richiede un approccio più complesso, che include la gestione delle patologie cardiache sottostanti e l'uso di anticoagulanti per prevenire la formazione di trombi. In termini di complicazioni a lungo termine, l'ictus lacunare può portare a deficit cognitivi e motori persistenti, mentre l'ictus cardioembolico può aumentare il rischio di insufficienza cardiaca, ar"""
                ]
            }
        ]
    )
    llm = InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
    )

    ultrafeedback = UltraFeedback(
        llm=llm,
        aspect="truthfulness",
        # input_mappings={"instruction": "evolved_instructions", "generations": "evolved_responses"},
        # output_mappings={"model_name": "ultrafeedback_model"},
        input_batch_size=8
    )

    loader >> ultrafeedback

if __name__ == "__main__":
    distiset = pipeline.dry_run()
    print(distiset)
riccardo-unipg commented 3 weeks ago

okey, it work, thank you all