Initial questions - Githubissues

deep-diver / llamaduo

This project showcases an LLMOps pipeline that fine-tunes a small-size LLM model to prepare for the outage of the service LLM.

https://huggingface.co/papers/2408.13467

Apache License 2.0

286 stars 29 forks source link

Initial questions #1

Closed sayakpaul closed 5 months ago

sayakpaul commented 7 months ago

Why use Alpaca, given it's already outdated? If it's for showcase-only, that's fine I think.
Would the synthetic dataset be used only for eval?

I think we need to make the scope of the coverage dataset is little more well-defined. From what I understand, the coverage dataset should be used during evaluation. Or will it be also used to seed the generation of synthetic dataset? In that case, aren't we contaminating the data by introducing leakage of some sorts?

sayakpaul commented 7 months ago

What could make sense is that we first tackle each of the steps defined here as standalone modules and then collate them later.

deep-diver commented 7 months ago

Here is the summary, and just apply some more explanation about coverage dataset into the readme.

In summary coverage dataset is split into seed/valid datasets seed datasets is used to generate more of similar styled synthetic data valid dataset is used to evaluate a fine-tuned LLM

Like you said, let's break down the whole thing into smaller pieces! I made Todo list, what do you think about it? Maybe break into smaller than this?

sayakpaul commented 7 months ago

Sounds good to me.

We need to come up with the coverage dataset first. Then the dataset generation process itself shouldn't take long and we will be able to move forward fairly quickly I reckon.

Let's continue to discuss the coverage dataset.