IBM / Dromedary

Dromedary: towards helpful, ethical and reliable LLMs.
GNU General Public License v3.0
1.13k stars 87 forks source link

About vicuna_dummy_data.json lack 'example_id' #14

Open Harry-mic opened 12 months ago

Harry-mic commented 12 months ago

Hi! I encounter a bug when doing the step3 (Principle Engraving). I used the self_align_merged.json which is created with "self_align32shards*.jsonl" and "vicuna_dummy_data.json" to finetune the base model.

However, I find that vicuna_dummy_data.json file items do not have 'example_id' labels. It results in a bug when execute function "extract_dromedary_dataset":

def extract_dromedary_dataset(example, meta_prompts):
    assert "example_id" in example
    total_meta_prompt = len(meta_prompts)
    meta_prompt = meta_prompts[int(example["example_id"]) % total_meta_prompt]
    if example.get("input", "") != "":
        prompt_format = DROMEDARY_PROMPT_DICT["prompt_input"]
    else:
        prompt_format = DROMEDARY_PROMPT_DICT["prompt_no_input"]
    return {
        "input": prompt_format.format(meta_prompt=meta_prompt, **example),
        "output": "\n" + example["output"],
    }

The vicuna_dummy_data are all labeled "example_id" = None, and result in a int error.

Therefore, I wonder how to deal with this issue and correctly get the vicuna_dummy_data example_ids.Thanks a lot for your reply!

Edward-Sun commented 12 months ago

Hi,

For now, please try the following code to replace the line of meta_prompt = meta_prompts[int(example["example_id"]) % total_meta_prompt]. We will add a commit to solve the issue soon.

example_id = 0
try:
    example_id = int(example["example_id"])
except:
    pass
meta_prompt = meta_prompts[example_id % total_meta_prompt]

Best, Zhiqing

Harry-mic commented 12 months ago

Thanks a lot for your reply and quick revision!

So in the original code, you tag all the unlabeled vicuna_dummy_data with 'example_id = 0'? I wonder what's the point to tag the vicuna_dummy_data with the same example_id while the self_align data is tag different example_id. Also, I notice vicuna_dummy_data are nearly all short conversations, so there seems a significant difference in quality between vicuna_dummy_data and self_align data.

By the way, do you do the inference with llama-2-70b while do the finetuning with llama-2-70b-hf? I notice there is a difference in loading model when doing the inference and doing the finetuning process.

I'd appreciate for your help!

Edward-Sun commented 12 months ago

Hi Harryis,

In our codebase, "example_id" only affects which prompt template to use, so it won't affect too much on the performance.

Also, if you inspect the data, you would find that the vicuna_dummy_data are only about the identity questions, such that the model generates correct outputs given inquiries about its name or developers. So in this case, it can be guaranteed that it would not affect the model's performance.

By the way, do you do the inference with llama-2-70b while do the finetuning with llama-2-70b-hf?

We use the original llama checkpoint (i.e., llama-2-70b) for model-parallel inference (from the original llama codebase). For fine-tuning, llama-2-70b-hf is used since we are using deepspeed (in Dromedary-1) or qlora (in Dromedary-2)

Harry-mic commented 12 months ago

Thanks a lot for your explaination!

Is it because of the faster inference of llama-2-70b ckpt code that you choose to use it rather than huggingface code? The past_key_values in the cache of huggingface code are also a problem.

Edward-Sun commented 11 months ago

Yes, when we developed this project around March/April, the faster inference techniques (e.g., TGI and vLLM) of llama had not been developed, so we tried our best to use a customized llama with a native model parallel to improve the generation throughput.