e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.
MIT License
981 stars 135 forks source link

Generating more questions per chunk #70

Closed johnr14 closed 1 hour ago

johnr14 commented 1 day ago

Hi, I am having trouble getting my head around the code...

I find that 4 questions per chunk could miss important points on high quality condensed academic content.

Suppose I would like to generate more questions per chunk and specify a second file like qatuples_gen_specialized.yaml and have a variable to re-run the the generation or swap a {special_instruction } in the prompt. How should it be done ?

Q-A would have to be de-duplicated later....

The other way around would be to have the LLM generate a list of all key-points, main ideas, explanations, descriptions or relevant information in the text, then generate questions that can be answered by those points.... That's how I did it in the past.

So should I go and create a task2 in processing.py like this ? but then how do I pass the qatuples_gen_specialized.yaml ?

    # Attempt to initialize filtered_worthy_for_questions
    tasks2 = [
        steps.generate_qadicts_from_para(
            idx,
            para,
            engine_wrapper_large=engine_wrapper_large,
            generated_qa_dicts=generated_qa_dicts,
        )
        for idx, para in enumerate(filtered_worthy_for_questions)
    ]
    limited_tasks_qgen = [run_task_with_limit(task) for task in tasks]
    for future in tqdmasyncio.tqdm.as_completed(limited_tasks_qgen):
        await future

or create a duplicate class in steps.py

class QuestionGenerationStepMore(PipelineStep): # like before, but with the new system. Override the read and save.
    def __init__(self):
        super().__init__(
            prompt_folder=PROMPTS_DIR,
            default_prompt_folder=DEFAULT_PROMPTS,
            prompt_path=prompt_path_qatuples_gen_specialized,
            regex=qatuples_gen_regex,
            sampling_params={
                "max_tokens": 2000,
                "stop": [
                    "### Response",
                    "\n\n\n\n\n",
                    "</s>",
                    "# Input:",
                    "[INST]",
                    "### Instruction",
                    "[INST",
                    "<|eot_id|>",
                    "<|start_header_id|>",
                    "<|end_header_id|>",
                ],
                "temperature": 0.8,
                # top_k=-1,
                "top_p": 1,
                # min_p=0.5,
            },
            output_dir=OUTPUT_DIR,
            output_subdir="question_generation_generations",
            output_processor=extract_questions_from_response,
            use_stop=USE_STOP,
            intermediate_output_path="question_generation_generations",
            completion_mode=COMPLETION_MODE,
            save_path="raw_qatuples_saved",
            result_key="not_used",
        )
        ....

But then I would have to duplicate the class for each new prompt ? And running it twice would it give more Q-A pairs ? I would like to run it once with a special instruction like one from this list :

 * Focus on what you identified during your analysis of the text as important and
        * Prioritize questions in one of the domains of knowledge that was identified in the text.
        * Prioritize questions related to concepts presented in the text.
        * Prioritize questions that are related to technical terms or jargon that are presented in the text.
        * Prioritize questions that are essential to help understand the overall message the author aims to convey.
        * Prioritize questions that can be answered by the text's essence and its main subjects and key points.
        * Prioritize questions in any specialized or advanced field of study like science, mathemathics, chemistry, pharmaceutical, management ...
        * Prioritize questions that helps to understand more about something.
        * Prioritize questions that can explain something.
        * Prioritize questions that are useful in expanding one's knowledge about the world or a field of study or lessons about life.

Thanks

johnr14 commented 1 day ago

I was looking at a test run and some questions will have to be dropped out as they are not related to the knowledge I wish to extract from a pdf. I am using : cat master_list.jsonl | jq | grep \"question\" | sort to get the list of questions and I find some like :

Are there any career options available for individuals interested in working in ... # publicity
How do the chances of winning the contest depend on the number of eligible entries received? # publicity for contest ?
Who designs the layout and cover of [...] # copyright ?
Who typically applies the 10th product (described in the tenth bullet point)? # what what ??
Why might reproducing content without permission be problematic? # not copyright again ?
How many pages long is the book  [...] # like that's more important than what it's about...
Is the fifth product suitable for clients # what again ?
What happens after the demands and internal references are reviewed by [...] # study information ?

But got some nice one like :

Can you summarize a study by [...]
johnr14 commented 1 day ago

I rewrote the qatuples_gen_no_filenames.yaml prompt and added 10 Q-A templates as well as more detailed information on what should be prioritized information. Normal run left 356 valid questions out of 556 generated questions (with a few bad ones like above), the new prompt generated 921 questions that have yet to be validated.... will update later when run finishes. and after validation left 505 questions.

So for a single PDF, went from 356 to 505 questions just by trying to get 10 questions per chunk.

Browsing the raw questions, it's a major upgrade to quality !!

I think there are some major gains still available by reworking the pipeline and prompts. So much that a re-run of your army llm may get some major uplift !

While I still got some out of context questions like :

What is the ISSN number of the journal?
What annual budget does [...] allocate between repaying her mortgage [...]
What return is used in the simulations presented for the three scenarios of acquiring a first home?
What [...] event took place at [...]
What is the preferred contraceptive method chosen by [FIRST NAME] ?
What is the unique aspect of the work-life balance offered by the city of [...]
What are the benefits offered to employees by [...]
What are the choices for the winning prize?

they may be somewhat informative, but are unrelated to what I want to train the LLM on and comes from publicity from within the source material.

Looking at together.ai/pricing, training a 8b LLM for 1 epoch on 1000 tokens cost 5$. On a single PDF, having poor questions is not so bad, but if it's a few years of PDF files of magazine, it could leads to significant cost wasted. That's why I think Q-A grouping and summarization should also be implemented.

I did have one \n\n**ANSWERS:** that was not well parsed and ended at the end of a previous question. I have 5 According to the text, that where generated and should not exist. For the rest, very impressed by my new results.

Can't wait to have multiple prompting per chunk to extract all pertinent information or use information extraction before generating questions...

Here, let me share my prompt :

content of qatuples_gen_no_filenames.yaml ``` - role: system content: | You are creating a logically-consistent series of questions about different domains, based on provided information. Given some information about something specific (it could be anything, from a README to a book excerpt to sales copy) you will create suitable questions based on the text, and *only* based on the text. You are focusing on understanding, application, analysis, and synthesis of ideas (cognitive levels). The questions will test comprehension of real information that would be worthy to teach in order for people to understand more about the specific material. The questions you create will lean towards longer, more difficult questions that require some thought to solve — but can still be solved given the paragraphs provided. After each question, you will also write its answer. Write the questions and the answers in the same language that the text was written in. Analyze the text's content and identify: Summary: Distill the text's essence and it's main subjects and key points Overall Message: Determine the central idea or main point the author aims to convey. Domain of Knowledge: Determine the subject area or field of study referenced. Concepts Explained: List the key concepts presented or explained. Technical Terms: Identify and define any specialized terminology, jargon, or technical vocabulary used. **You Must:** * Create detailed educational questions based on some information. * Focus on what you identified during your analysis of the text as important and : * Prioritize questions in one of the domains of knowledge that was identified in the text. * Prioritize questions related to concepts presented in the text. * Prioritize questions that are related to technical terms or jargon that are presented in the text. * Prioritize questions that are essential to help understand the overall message the author aims to convey. * Prioritize questions that can be answered by the text's essence and its main subjects and key points. * Prioritize questions in any specialized or advanced field of study like science, mathemathics, chemistry, pharmaceutical, management ... * Prioritize questions that helps to understand more about something. * Prioritize questions that can explain something. * Prioritize questions that are useful in expanding one's knowledge about the world or a field of study or lessons about life. * Do not mention the text, or any other reference, in either the questions or answers. Just ask about the facts or information itself. * Create as many or as few questions as are needed to adequately cover the material in the snippet of the site. * Ensure a logical sequence of questions that build on each other. * Keep in mind the timestamp of any solution (some provided information may be out of date). The current year is: 2024. * Use markdown formatting (such as code blocks and snippets) in your questions if applicable. * Focus on important information that is relevant to understanding the subject. So you may ignore arbitrary metadata such as authors or timestamps -- do not make those the subjects of questions. * Keep in mind that the provided text snippet may cut off abruptly. Work around this. * Include the information from the given paragraph alongside a simpler explanation of some underlying concepts, if possible. * IF INFORMATION IN THE TEXT/PROVIDED INFORMATION CONTRADICTS WHAT YOU KNOW, FAVOR THE TEXT. * The answerer should take on any opinions of the author. If it would make sense to ask the author's opinion, the question should ask what the answerer ("you") thinks. **Note:** * Documents may be out of date, and technology described as being in development has likely been released already. THEREFORE, BE AMBIGUOUS ABOUT RELEASES, using language like "This technology will do XYZ" or by focusing on what the tech "can" do, rather than what it "does" or "will" do. * You will always take a positive opinion of the provided information and try to promote it through education. * Do NOT provide anchor links to content in your answers; since the provided text to make questions from is from a website, those links will be broken when used outside of said website. So you should NEVER have any content like [some text](#!/some/path) in your answer! External links should not be given either. NO LINKS. The sequence of the questions matters. They should build on each other. While questions should build on each other, they still MUST make sense if read by themselves, without any reference materials on hand. Do not explicitly mention the paragraphs in the questions themselves — just ask about the concepts related to the questions. BE CAREFUL NOT TO ASK QUESTIONS ABOUT THINGS THAT DO NOT APPEAR IN THE TEXT. You will not mention the text explicitly in any questions you think of, since the questions you generate are intended to test people's knowledge of the information — when given the questions, they WILL NOT HAVE THE TEXT ON HAND, and so if you mention the author they won't have a clue what you're talking about. Write the questions and answers in the same language that the text is in. You must strictly adhere to the format : **QUESTION:** Write the first question here as a single paragraph. **ANSWER:** Write the first answer here as a single paragraph. **QUESTION:** Write the second question here as a single paragraph. **ANSWER:** Write the second answer here as a single paragraph. **QUESTION:** Write the third question here as a single paragraph. **ANSWER:** Write the third answer here as a single paragraph. **QUESTION:** Write the fourth question here as a single paragraph. **ANSWER:** Write the fourth answer here as a single paragraph. **QUESTION:** Write the fifth question here as a single paragraph. **ANSWER:** Write the fifth answer here as a single paragraph. **QUESTION:** Write the sixth question here as a single paragraph. **ANSWER:** Write the sixth answer here as a single paragraph. **QUESTION:** Write the seventh question here as a single paragraph. **ANSWER:** Write the seventh answer here as a single paragraph. **QUESTION:** Write the eighth question here as a single paragraph. **ANSWER:** Write the eighth answer here as a single paragraph. **QUESTION:** Write the ninth question here as a single paragraph. **ANSWER:** Write the ninth answer here as a single paragraph. **QUESTION:** Write the tenth question here as a single paragraph. **ANSWER:** Write the tenth answer here as a single paragraph. - role: user content: | Text to make questions from: """ {paragraph} """ ----------- Reminder: do not mention the text, the provided information, the paragraphs, the work, or the author. Any questions about the author should be changed to be about the answerer ("you") ```
johnr14 commented 1 hour ago

Closing, building my own pipeline, had a hard time getting my head around original.