argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.64k stars 129 forks source link

[DOCS] backtranslation docs #613

Closed elaaaf closed 6 months ago

elaaaf commented 6 months ago

Which page or section is this issue related to?

https://distilabel.argilla.io/latest/sections/papers/instruction_backtranslation

What are you documenting, or what change are you making in the documentation?

Hello, Thank you for your great work. I came across this page that implements the backtranslation paper. And I found that you use the prompt as an input where it should be the completion. Is this a bug? or was the code only for validating the prompt, if so, it's best if the documentation should state that it's only for validation.

plaguss commented 6 months ago

cc @alvarobartt

alvarobartt commented 6 months ago

Hi here @elaaaf! So we are indeed implementing the "self-curation" step defined in https://arxiv.org/pdf/2308.06259, which evaluates the quality of an instruction-completion pair as seen in the table from the paper shown below.

image

alvarobartt commented 6 months ago

Also note that the inputs of InstructionBacktranslation are both the instruction and the generation which are replaced in the prompt template shown above as <generated_instruction> and output, respectively.

alvarobartt commented 6 months ago

Here's the code from the documentation but with code-comments to show the existing inputs on each stage in case this is more clear 👍🏻

from distilabel.llms import InferenceEndpointsLLM, OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadHubDataset
from distilabel.steps.tasks import InstructionBacktranslation, TextGeneration

with Pipeline(name="self-alignment-with-instruction-backtranslation") as pipeline:
    # inputs: none
    # outputs: instruction
    load_hub_dataset = LoadHubDataset(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    # inputs: instruction
    # outputs: generation, generation_model
    text_generation = TextGeneration(
        name="text_generation",
        llm=InferenceEndpointsLLM(
            base_url="<INFERENCE_ENDPOINT_URL>",
            tokenizer_id="argilla/notus-7b-v1",
            model_display_name="argilla/notus-7b-v1",
        ),
        input_batch_size=10,
        output_mappings={"model_name": "generation_model"},
    )
    load_hub_dataset.connect(text_generation)

    # inputs: instruction, generation
    # outputs: score, reason, scoring_model
    instruction_backtranslation = InstructionBacktranslation(
        name="instruction_backtranslation",
        llm=OpenAILLM(model="gpt-4"),
        input_batch_size=10,
        output_mappings={"model_name": "scoring_model"},
    )
    text_generation.connect(instruction_backtranslation)
alvarobartt commented 6 months ago

We are also aware that are parts that are not covered in the distilabel implementation, so what you are asking is whether we could implement those? Is there anything in particular you're interested in? Just let us know and we can clarify and extend the current implementation, as yes, we're only implementing the self-curation step from this specific paper.

e.g. we implement 2 but 1 is missing as of the docs 1. Self-augment: Generate instructions for unlabelled data, i.e. the web corpus, to produce candidate training data of (instruction, output) pairs for instruction tuning.

elaaaf commented 6 months ago

We are also aware that are parts that are not covered in the distilabel implementation, so what you are asking is whether we could implement those? Is there anything in particular you're interested in? Just let us know and we can clarify and extend the current implementation, as yes, we're only implementing the self-curation step from this specific paper.

Thank you so much for the clarification @alvarobartt ! I don't actually need any additional parts implemented at this time. It's just that I spent a couple of hours understanding the code. It would be really helpful if you could update the docs to explicitly state that the current implementation covers only the self-curation step from the paper. It would definitely help anyone reading the page.

alvarobartt commented 6 months ago

Fair @elaaaf! We'll do so, as well as implementing the remaining before the next release, since it can also add value! Thanks for opening the issue, we'll close it once the docs are updated and that the integration is extended to cover the whole paper!

alvarobartt commented 6 months ago

I've just fixed the documentation to explain that we only implement the self-curation part @elaaaf, I'll try to find some time in the upcoming weeks to add the full reproduction instead, but for the moment the docs are clarified now! Thanks :)