IBM / Dromedary

Dromedary: towards helpful, ethical and reliable LLMs.
GNU General Public License v3.0
1.11k stars 86 forks source link

About the way to generate 99,121 synthetic prompts from TGRT Self-Instruct #13

Open Harry-mic opened 10 months ago

Harry-mic commented 10 months ago

Hello,Thanks for your awesome work and code!

However, I encountered some confusion while trying to understand how you generated TGRT Self Instruction. You mentioned in the article that you first handwrite 20 instruction types and then generated some topics from these types. Finally, instructions were generated by the “instruction type - topic" pair.

Therefore, my first question is: How many topics have you generated with each instruction type? I see in Appendix G that your prompt generates 10 topics for each instruction type.

My second question is : How many instructions will be generated for each "instruction type - topic" pair? Because you finally get 99,121 synthetic prompts from TGRT Self-Instruct, if every "instruction type - topic" pair generates only one instruction, does it mean you at least generate 99,121 topics?

Thanks a lot for your help!

Edward-Sun commented 10 months ago

Hi Harryis,

Yes. We generated around 120k synthetic topics (after filtering on the topics) from TGRT Self-Instruct, generated the corresponding 120k prompts, and did some filtering on the prompts to get the final 99k prompts.

As can be seen from the code, we randomly sample topics from existing topics to construct new topics. So ideally we will get the same amount of topics of each instruction type, but that number will be different due to filtering.

Harry-mic commented 10 months ago

Thanks a lot for your reply!

It's clear for me that you use 20 instruction types to generate 120K synthetic topics, which means every instruction type will generate about 120k/20 =6k topics. However, How do you "randomly sample topics from existing topics to construct new topics"? As I know, the topic generation prompt as below doesn't involve selecting existing topics, it only involves one of the 20 instuction types:


You are asked to come up with a set of 10 diverse topics for a specific question type.

Here are the requirements:

1. Try not to repeat the words for each topic to maximize diversity.
2. Each topic should contain up to three words.
3. Each topic should be a noun phrase, its first word should be capitalized.
4. The topics should be closely related to the given question type: {}.

List of 10 topics:

Otherwise you use ICL as topic examples in "List of 10 topics:", and then the topic examples are iterative.

Thanks a lot for your help!

Edward-Sun commented 10 months ago

Hi Harryis,

We generate the topics in several rounds (called generation_epoch in the code), where in each round, we sample all topics from the previous rounds as the seed to produce new topics.

Harry-mic commented 10 months ago

Oh,I get it!Thanks!