Open davidberenstein1957 opened 3 months ago
Hi here! I don't think we should move that to metadata, we can indeed ALSO move it to metadata, but not to be the only place, since some tasks do expect the model or some sort of identifier for e.g. the generations, and we're already actively using this along the codebase. Maybe what we need to do is to improve the wording and how we communicate about the model_name
being automatically injected and in pipelines with more than 1 sequential task needing a rename / output mapping.
cc @gabrielmbmb @plaguss to share their opinion too (if any)
Hi @alvarobartt, I did not it was being used internally. How are we using it exactly? I think moving it ALSO would help with some tracing intermediary steps, results etc, likely with similar / mostly the same reasoning as for the raw_output
and raw_input
usages.
Hi @alvarobartt, I did not it was being used internally. How are we using it exactly? I think moving it ALSO would help with some tracing intermediary steps, results etc, likely with similar / mostly the same reasoning as for the raw_output and raw_input usages.
Oops sorry, maybe my wording was not the best! I meant we use that in some examples, tutorials, etc. as well as optional inputs in the preference / rating / ranking steps, so that would lead to potential conflicts with existing pipelines that do use that model_name
or whatever rename is applied to that property. But I agree that adding that within the metadata
makes a lot of sense, I was just advocating for not removing the current model_name
but just copying it over to the metadata
aka. raw outputs.
Hi! related to this change, we initially moved the model_name
to the distilabel_metadata
column but decided to keep it in its own column because it can be used easily with CombineColumns
for example, what would yield a dataset similar to the UltraFeedback
format for example. Also, having the model_name
in it's own column makes it easier to update the column name if you want to store the column (just renaming the column):
text_generation = TextGeneration(
llm=...,
output_mappings={"model_name": "generation_model"},
)
or for multiple models, if one just let's the column be overridden and removes the model_name
column at the end, less work to do.
Is your feature request related to a problem? Please describe. Along with the addition of
raw_output_<step_name>
and the proposedraw_input_<step_name>
(https://github.com/argilla-io/distilabel/issues/698). I think it would be nice to align this a bit more also move themodel_name
to thedistilabel_metadata
and add a suffix of the step name, restulting inmodel_name_<step_name>
.Describe the solution you'd like CURRENT
PROPOSED UPDATE
Describe alternatives you've considered N.A:
Additional context N.A.