argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.45k stars 111 forks source link

[FEATURE] move `model_name` to `distilabel_metadata` dictionary with step name suffix #722

Open davidberenstein1957 opened 3 months ago

davidberenstein1957 commented 3 months ago

Is your feature request related to a problem? Please describe. Along with the addition of raw_output_<step_name> and the proposed raw_input_<step_name> (https://github.com/argilla-io/distilabel/issues/698). I think it would be nice to align this a bit more also move the model_name to the distilabel_metadata and add a suffix of the step name, restulting in model_name_<step_name>.

Describe the solution you'd like CURRENT

# [
#     {
#         "instruction": "What's the capital of Spain?",
#         "generation": "The capital of Spain is Madrid.",
#         "model_name": "gpt-4",
#         "distilabel_metadata": {
#             "raw_output_text-generation": "The capital of Spain is Madrid"
#         }
#     }
# ]

PROPOSED UPDATE

# [
#     {
#         "instruction": "What's the capital of Spain?",
#         "generation": "The capital of Spain is Madrid.",
#         "distilabel_metadata": {
#             "raw_output_text-generation": "The capital of Spain is Madrid",
#             "model_name-generation": "gpt-4"
#         }
#     }
# ]

Describe alternatives you've considered N.A:

Additional context N.A.

alvarobartt commented 3 months ago

Hi here! I don't think we should move that to metadata, we can indeed ALSO move it to metadata, but not to be the only place, since some tasks do expect the model or some sort of identifier for e.g. the generations, and we're already actively using this along the codebase. Maybe what we need to do is to improve the wording and how we communicate about the model_name being automatically injected and in pipelines with more than 1 sequential task needing a rename / output mapping.

cc @gabrielmbmb @plaguss to share their opinion too (if any)

davidberenstein1957 commented 3 months ago

Hi @alvarobartt, I did not it was being used internally. How are we using it exactly? I think moving it ALSO would help with some tracing intermediary steps, results etc, likely with similar / mostly the same reasoning as for the raw_output and raw_input usages.

alvarobartt commented 3 months ago

Hi @alvarobartt, I did not it was being used internally. How are we using it exactly? I think moving it ALSO would help with some tracing intermediary steps, results etc, likely with similar / mostly the same reasoning as for the raw_output and raw_input usages.

Oops sorry, maybe my wording was not the best! I meant we use that in some examples, tutorials, etc. as well as optional inputs in the preference / rating / ranking steps, so that would lead to potential conflicts with existing pipelines that do use that model_name or whatever rename is applied to that property. But I agree that adding that within the metadata makes a lot of sense, I was just advocating for not removing the current model_name but just copying it over to the metadata aka. raw outputs.

plaguss commented 3 months ago

Hi! related to this change, we initially moved the model_name to the distilabel_metadata column but decided to keep it in its own column because it can be used easily with CombineColumns for example, what would yield a dataset similar to the UltraFeedback format for example. Also, having the model_name in it's own column makes it easier to update the column name if you want to store the column (just renaming the column):

text_generation = TextGeneration(
    llm=...,
    output_mappings={"model_name": "generation_model"},
)

or for multiple models, if one just let's the column be overridden and removes the model_name column at the end, less work to do.