Closed riccardo-unipg closed 3 weeks ago
HI @riccardo-unipg, I cannot see the examples in the distilabel_dataset.jsonl
file. The error seems related to the length of the messages you have stored in the file. Maybe when you use the EvolInstruct
the error is controlled there, will review it to see if that's the case. In the meantime, you could limit the length of the column you are using, maybe using the tokenizer of your model to limit the amount of tokens of the examples, so you can fit them in the context window.
Additionally, the UltraFeedback
task expects the following format: https://distilabel.argilla.io/dev/components-gallery/tasks/ultrafeedback/#examples. Could you double check your dataset adheres to that format? (but given the error this doesn't seem the case)
This model's maximum context length is 24288 tokens
Hi, thanks for the reply, I got confused and marked this issue as completed against my will, :(. Anyway, regarding the format that ultrafeedback wants, I made sure it was the right one through the line '''python input_mappings={"instruction": "evolved_instructions", "generations": "answers"} ''', in fact it doesn't give me any error of this type.
Regarding the length of the context, what amazes me is that the model's maximum context length is 24288 tokens, it's really big, it will never be filled by an evolved_instructions and an answers, that's what I can't understand, I also tried to reduce the number of batches by bringing it to 1, but it always gives me the same error.
Hi @riccardo-unipg, can you try removing the loader.load()
line of code? As you're not using LoadDataFromFileSystem
as a standalone component, you shouldn't be calling load
manually, it will be called automatically by the Pipeline
. Also, check LoadDataFromFileSystem
what's the content of the batches being yield by the step when used in combination with your ./dataset/distilabel_dataset.jsonl
file. It can be happening that it's not reading the data correctly and it's returning a super long text because it's the whole content of the file.
CIAO@riccardo-unipg, puoi provare a rimuovere la
loader.load()
riga di codice? Poiché non lo stai usandoLoadDataFromFileSystem
come componente autonomo, non dovresti chiamarloload
manualmente, verrà chiamato automaticamente daPipeline
. Inoltre, controllaLoadDataFromFileSystem
qual è il contenuto dei batch che vengono prodotti dal passaggio quando usato in combinazione con il tuo./dataset/distilabel_dataset.jsonl
file. Può succedere che non legga correttamente i dati e che restituisca un testo molto lungo perché è l'intero contenuto del file.
When i remove loader.load()' i get: ValueError: Dataset not loaded yet, you must call
load` method first.
my question is, ulktrafeedback work only if i use it in a pipeline with evolving module like EvolInstruct or EvolQuality ? Or i can use it like in this section so with the dataset already ready?
i also tried with LoadDataFromHub() and i remove loader.load() but it give me the same context length error
When i remove
loader.load()' i get: ValueError: Dataset not loaded yet, you must call
load` method first.
Hi @riccardo-unipg , this seems a bug, it should be solved in the develop
branch, but will check it.
UPDATE -> the bug is solved in develop, using LoadDataFromFileSystem
should be working normally.
my question is, ulktrafeedback work only if i use it in a pipeline with evolving module like EvolInstruct or EvolQuality ? Or i can use it like in this section so with the dataset already ready?
The UltraFeedback
task can work without using the EvolInstruct
.
Can you test it outside of the pipeline with a sample row of your dataset like in this example from the documentation? Update to the LLM you are using and let's see if it works
When i remove
loader.load()' i get: ValueError: Dataset not loaded yet, you must call
load` method first.Hi @riccardo-unipg , this seems a bug, it should be solved in the
develop
branch, but will check it. UPDATE -> the bug is solved in develop, usingLoadDataFromFileSystem
should be working normally.
It still doesn't work, it always gives the same error, I also removed the loader.load() instruction which now doesn't give any more errors, but the main error remains, that is, the length of the context
my question is, ulktrafeedback work only if i use it in a pipeline with evolving module like EvolInstruct or EvolQuality ? Or i can use it like in this section so with the dataset already ready?
The
UltraFeedback
task can work without using theEvolInstruct
.Can you test it outside of the pipeline with a sample row of your dataset like in this example from the documentation? Update to the LLM you are using and let's see if it works
Yeah, it work
my question is, ulktrafeedback work only if i use it in a pipeline with evolving module like EvolInstruct or EvolQuality ? Or i can use it like in this section so with the dataset already ready?
The
UltraFeedback
task can work without using theEvolInstruct
. Can you test it outside of the pipeline with a sample row of your dataset like in this example from the documentation? Update to the LLM you are using and let's see if it worksYeah, it work
i tried to do all the dataset outside the pipeline, giving input_batch_size but it return the same contecxt length error. I think it'like input_batch_size parameter doesn't work, it load all the data in one batch and it is to long for processing it
Can you do a dry_run and check the pipeline works with a single example? The examples are working, and I would think there's something happening with the data. If the dry_run
works, you should take a look at your dataset because there can be long examples, and UltraFeedback
needs to use some of them at the same time to feed them to the LLM
Can you do a dry_run and check the pipeline works with a single example? The examples are working, and I would think there's something happening with the data. If the
dry_run
works, you should take a look at your dataset because there can be long examples, andUltraFeedback
needs to use some of them at the same time to feed them to the LLM
Same error with dry_run and small batches
[08/19/24 09:59:33] INFO ['distilabel.pipeline'] 🌵 Dry run mode base.py:377
INFO ['distilabel.pipeline'] 📝 Pipeline data will be written to base.py:724
'/home/jovyan/.cache/distilabel/pipelines/DEITA/3be6bdb6e7dc323cf5a8ac7fe3
69bd0bcb7c6072/data/steps_outputs'
INFO ['distilabel.pipeline'] ⌛ The steps of the pipeline will be loaded in base.py:733
stages:
'expand_evolved_instructions' replicas: 1/1
INFO ['distilabel.pipeline'] ✅ All the steps from stage 0 have been loaded! base.py:994
INFO ['distilabel.step.load_data_from_file_system_0'] 🧬 Starting step_wrapper.py:167
yielding batches from generator step
'load_data_from_file_system_0'. Offset: 0
INFO ['distilabel.step.load_data_from_file_system_0'] 📨 Step step_wrapper.py:274
'load_data_from_file_system_0' sending batch 0 to output queue
INFO ['distilabel.step.load_data_from_file_system_0'] 🏁 Finished step_wrapper.py:127
running step 'load_data_from_file_system_0' (replica ID: 0)
INFO ['distilabel.step.evol_instruction_complexity'] 📦 Processing step_wrapper.py:217
batch 0 in 'evol_instruction_complexity' (replica ID: 0)
INFO ['distilabel.step.evol_instruction_complexity'] 📨 Step step_wrapper.py:274
'evol_instruction_complexity' sending batch 0 to output queue
INFO ['distilabel.step.evol_instruction_complexity'] 🏁 Finished step_wrapper.py:127
running step 'evol_instruction_complexity' (replica ID: 0)
INFO ['distilabel.step.ultrafeedback'] 📦 Processing batch 0 in step_wrapper.py:217
'ultrafeedback' (replica ID: 0)
WARNING ['distilabel.step.ultrafeedback'] ⚠️ Processing batch 0 with step step_wrapper.py:240
'ultrafeedback' failed. Sending empty batch filled with None
s...
WARNING ['distilabel.step.ultrafeedback'] Subprocess traceback: step_wrapper.py:244
Traceback (most recent call last):
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/pipeline/step_wrapper.py", line 228, in
_non_generator_process_loop
result = next(step.process_applying_mappings(*batch.data))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/steps/base.py", line 638, in
process_applying_mappings
for output_rows in generator:
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/steps/tasks/base.py", line 267, in
process
outputs = self.llm.generate(
^^^^^^^^^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/llms/base.py", line 357, in generate
return self.event_loop.run_until_complete(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/nest_asyncio.py", line 98, in run_until_complete
return f.result()
^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/asyncio/futures.py", line 203, in result
raise self._exception.with_traceback(self._exception_tb)
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/asyncio/tasks.py", line 279, in __step
result = coro.throw(exc)
^^^^^^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/llms/base.py", line 327, in _agenerate
return await asyncio.gather(*tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/asyncio/tasks.py", line 349, in __wakeup
future.result()
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/asyncio/tasks.py", line 277, in __step
result = coro.send(None)
^^^^^^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/llms/openai.py", line 268, in agenerate
completion = await
self._aclient.chat.completions.create(**kwargs) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/openai/resources/chat/completions.py", line 1295,
in create
return await self._post(
^^^^^^^^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/openai/_base_client.py", line 1826, in post
return await self.request(cast_to, opts, stream=stream,
stream_cls=stream_cls)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/openai/_base_client.py", line 1519, in request
return await self._request(
^^^^^^^^^^^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/openai/_base_client.py", line 1620, in _request
raise self._make_status_error_from_response(err.response) from
None
openai.BadRequestError: Error code: 400 - {'object': 'error',
'message': "This model's maximum context length is 24288 tokens.
However, you requested 175968 tokens (175456 in the messages, 512
in the completion). Please reduce the length of the messages or
completion.", 'type': 'BadRequestError', 'param': None, 'code':
400}
INFO ['distilabel.step.ultrafeedback'] 📨 Step 'ultrafeedback' sending [step_wrapper.py](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py):[274](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py#274)
batch 0 to output queue
INFO ['distilabel.step.ultrafeedback'] 🏁 Finished running step [step_wrapper.py](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py):[127](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py#127)
'ultrafeedback' (replica ID: 0)
INFO ['distilabel.step.expand_evolved_instructions'] 📦 Processing [step_wrapper.py](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py):[217](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py#217)
batch 0 in 'expand_evolved_instructions' (replica ID: 0)
WARNING ['distilabel.step.expand_evolved_instructions'] ⚠️ Processing batch [step_wrapper.py](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py):[240](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py#240)
0 with step 'expand_evolved_instructions' failed. Sending empty
batch filled with `None`s...
[08/19/24 09:59:36] WARNING ['distilabel.step.expand_evolved_instructions'] Subprocess step_wrapper.py:244 traceback:
Traceback (most recent call last):
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/pipeline/step_wrapper.py", line 228, in
_non_generator_process_loop
result = next(step.process_applying_mappings(*batch.data))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/steps/base.py", line 638, in
process_applying_mappings
for output_rows in generator:
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/steps/columns/expand.py", line 111, in
process
yield [row for input in inputs for row in
self._expand_columns(input)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/steps/columns/expand.py", line 111, in
<listcomp>
yield [row for input in inputs for row in
self._expand_columns(input)]
^^^^^^^^^^^^^^^^^^^^
^^^^^^^
File
"/home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.1
1/site-packages/distilabel/steps/columns/expand.py", line 126, in
_expand_columns
for item, expanded in zip_longest(*[data, expanded_rows],
fillvalue=input):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^
TypeError: 'NoneType' object is not iterable
INFO ['distilabel.step.expand_evolved_instructions'] 📨 Step [step_wrapper.py](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py):[274](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py#274)
'expand_evolved_instructions' sending batch 0 to output queue
INFO ['distilabel.step.expand_evolved_instructions'] 🏁 Finished [step_wrapper.py](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py):[127](file:///home/jovyan/synthetic_data_workspace/synthetic_env/lib/python3.11/site-packages/distilabel/pipeline/step_wrapper.py#127)
running step 'expand_evolved_instructions' (replica ID: 0)
Generating train split: 1/0 [00:00<00:00, 122.03 examples/s]
If this can help you, i have notice the same error iin the EvolInstruct module when set store_evolutions = False. It's possible that there is a problem when handle data that is not in a list but only a string?
The dry_run
takes a single example by default via input_batch_size
. Can you check/share the first example in your distilabel_dataset.jsonl
file?
I share with you the HF link to the dataset: https://huggingface.co/datasets/Magic-Ric/prova
The dataset is not following the format expected from UltraFeedback. The generations should be a list with the strings you want to pass. You need to update the dataset accordingly:
from distilabel.llms import InferenceEndpointsLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub, LoadDataFromDicts
from distilabel.steps.tasks import UltraFeedback
import os
# os.environ["OPENAI_API_KEY"] = "EMPTY"
with Pipeline(name="DEITA") as pipeline:
# loader = LoadDataFromHub(repo_id="Magic-Ric/prova")
loader = LoadDataFromDicts(
data=[
{
"instruction": "Quali sono le principali differenze tra ictus ischemico e ictus emorragico in termini di cause, sintomi e trattamenti?",
"generations": [
"""Le differenze fondamentali tra ictus ischemico e ictus emorragico risiedono nelle cause, nei sintomi e nei trattamenti. L'ictus ischemico, che rappresenta circa l'80% degli eventi cerebrovascolari acuti, è causato da un'occlusione dei vasi sanguigni cerebrali a causa di un trombo o di un embolo, che porta a una riduzione o interruzione del flusso sanguigno cerebrale. Al contrario, l'ictus emorragico è causato da una rottura della parete dei vasi sanguigni cerebrali, che porta a un'emorragia all'interno del cervello. I sintomi dell'ictus ischemico possono includere debolezza o paralisi di un lato del corpo, difficoltà di parola, problemi di visione e perdita di equilibrio. I sintomi dell'ictus emorragico possono essere più gravi e includono cefalea intensa, vomito, confusione e perdita di coscienza. Il trattamento dell'ictus ischemico prevede l'uso di farmaci anticoagulanti e trombolitici per sciogliere il trombo e ripristinare il flusso sanguigno cerebrale. Il trattamento dell'ictus emorragico prevede l'uso di farmaci per controllare la pressione sanguigna e prevenire ulteriori emorragie, nonché interventi chirurgici per rimuovere il sangue accumulato nel cervello. È importante notare che entrambi i tipi di ictus possono avere conseguenze gravi e permanenti, come la disabilità e la morte. Pertanto, è fondamentale riconoscere i sintomi dell'ictus e cercare immediatamente assistenza medica in caso di sospetto. Inoltre, la prevenzione delle malattie cerebrovascolari attraverso la gestione dei fattori di rischio, come l'ipertensione, il diabete e il fumo, può aiutare a ridurre il rischio di ictus.""",
"""Le differenze principali tra ictus lacunare e ictus cardioembolico possono essere analizzate in base a fattori di rischio, manifestazioni cliniche, strategie di prevenzione e possibili complicazioni a lungo termine, tenendo conto delle diverse popolazioni di pazienti e delle variazioni nella presentazione dei sintomi. In termini di fattori di rischio, l'ictus lacunare è spesso associato a ipertensione, diabete mellito e ipercolesterolemia, mentre l'ictus cardioembolico è più frequentemente legato a patologie cardiache come la fibrillazione atriale, la valvulopatia mitralica e la cardiomiopatia. Inoltre, l'ictus lacunare tende a colpire persone più giovani e di sesso maschile, mentre l'ictus cardioembolico è più comune nelle persone anziane e di sesso femminile. Le manifestazioni cliniche dell'ictus lacunare sono spesso caratterizzate da sintomi focali, come la paralisi di un arto o la perdita della sensibilità, mentre l'ictus cardioembolico può presentarsi con sintomi più diffusi, come la confusione, la perdita di coscienza e la paralisi di più arti. Inoltre, l'ictus lacunare tende a essere più lieve e reversibile, mentre l'ictus cardioembolico può essere più grave e avere conseguenze a lungo termine più significative. Le strategie di prevenzione per l'ictus lacunare si concentrano sull'controllo dei fattori di rischio, come la pressione sanguigna e il diabete, mentre la prevenzione dell'ictus cardioembolico richiede un approccio più complesso, che include la gestione delle patologie cardiache sottostanti e l'uso di anticoagulanti per prevenire la formazione di trombi. In termini di complicazioni a lungo termine, l'ictus lacunare può portare a deficit cognitivi e motori persistenti, mentre l'ictus cardioembolico può aumentare il rischio di insufficienza cardiaca, ar"""
]
}
]
)
llm = InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
)
ultrafeedback = UltraFeedback(
llm=llm,
aspect="truthfulness",
# input_mappings={"instruction": "evolved_instructions", "generations": "evolved_responses"},
# output_mappings={"model_name": "ultrafeedback_model"},
input_batch_size=8
)
loader >> ultrafeedback
if __name__ == "__main__":
distiset = pipeline.dry_run()
print(distiset)
okey, it work, thank you all
Describe the bug I want to use Ultrafeedback task in a pipeline, but i have already the dataset, so the pipeline include only loading the dataset and after pass it to ultrafeedback module. Doing this, give this error: WARNING ['distilabel.step.ultrafeedback'] ⚠️ Processing batch 12 with step step_wrapper.py:240 'ultrafeedback' failed. Sending empty batch filled with
None
s...WARNING ['distilabel.step.ultrafeedback'] Subprocess traceback: step_wrapper.py:244
To Reproduce Code to reproduce
Expected behaviour If you notice, in the code there are some commented parts, if i use ultrafeedback with that commented parts, for example after EvolInstruct module, it work very well, all i did is to comment the EvolInstruct Module and connect loader with ultrafeedback (because i have already the generated dataset) and it don't work anymore, giving problem with prompt lenght. I don't understand why.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context Add any other context about the problem here.