e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.
MIT License
1.02k stars 139 forks source link

YAML prompts and validation #69

Closed johnr14 closed 3 weeks ago

johnr14 commented 3 weeks ago

Hi, I was going to write some app from scratch to do what you already did, so I am trying out your great app. EDIT: Sorry for the wall of text, it's mostly some ideas and where I am going with your app, can be helpful for you or others.

I find it cumbersome that code in steps.py need to be modified for a specific use-case... Also, I've seen many use JSON output for structured output that could simplify the prompting, save tokens, and insert directives in the .yaml on how to process the resulting JSON without having to touch the python code. Not sure if JSON is widely supported by llm...

I was thinking that having a single processing function that will take the prompts and the validation from a config file would be much better.

So for the yaml file :

That way, prompts could be all processed with a single pipeline and :

This kind of pipeline could be possible :

prompt_domain_knowledge.yaml - > fast_prompt_general_domain_related_questions.yaml -> revelant_check.yaml \>
                                                                             -> multi_shot_complex_prompt_on_high_fail.yaml -> 
                                                             -> specific_domain_questions_maker.yaml                                   -> validator.yaml - > fast_validation.yaml
                                                             -> expert_question_maker_70b.yaml                /> question_difficulty.yaml
                                                             -> unanswered_questions_tool_fetch.yaml

I was looking at your prompts, and they are quite long. I have ollama that stops responding once in a while and must kill run_augmentoolkit.py to restart the run. -> EDIT: Seem better after setting ctx from 8k to 16k.

I have had success with short prompts and that could save lots of tokens. Maybe some miss, but if we get a high rate of rejection, could change prompts for question generation that are more advanced/cost more. Also, I am thinking of specifying the knowledge domain of the questions to be from a list, so I don't get questions like Which French book focuses on rethinking the concept of that are just a waste of tokens.

Sidetrack notes Also thinking of some preprocessing post-processing using a MICRO llm like 2-3b for a few checks, like validating that it's knowledge I want to extract. A LARGE llm could also be used to generate a summary about a specific idea of the text that would be trained using a question like Give me a general explication about ... and What does "this concept" relate to and how could you explain it to me in a {simple|expert} way? Then mix some(most|random?) related concepts together and have a 405b try to make sens of it in a big and coherent way to explain expert knowledge for non expert to learn it or ask questions about what he doesn't understand ... like good teaching material. It would require some way to accumulate related knowledge, add a field to dataset ? (not there yet). End sidetrack notes

Any way, what got me here is that I got a false reject using dolphin 8b:

## Final Judgment on Answer Relevance:
#### Relevance Assessment: The answer is highly relevant to the question asked.
#### Explanation of Judgment: The answer accurately summarizes Karine's completed certificate and her current pursuit, making it a direct and accurate response to the question.
Answer relevancy validation failed! Tossing

EDIT: Found that it's in def parse_answer_relevancy_validation_step(thought_process): I am processing some data, grep the Explanation of Judgment and looking for keywords to add like

elif (
              "relevant" in determination
               or "Relevant" in determination
               or "answer provides" in determination # new
               or "provides a clear" in determination # new
               or "comprehensive explanation" in determination # new
               or "no reason to consider it irrelevant" in determinationkeywords  # new... waiting for a longer run to get logs for more 
        ):

Will change from dolphin to hermes 8b. EDIT: hermes does a really better job and appends Relevant or Irrelevant like 90%+ of the time while dolphin was like 25% !

I think that instead of looking for all sort of keywords, a MICRO LLM could parse Relevance Assessment and Explanation of Judgmen to return a json with a "Assessment": "True" or "Assessment": "False" to prevent bad identification. That would be cheap as few tokens and low price on such small LLM, could be local with 4Gb NVRAM...

I was looking to fix this, then I started thinking where would be the best place hence, having it direcly in the YAML seems the best way.

EDIT: Faults related to the LLM. Curently using hermes/dolphin/mistral/qwen in different quants to play around with a few pdf. Having the assessment validated by a micro llm seems a good approch that could mitigate llms that don't follow the directive to append Relevant or Irrelevant.

So while thinking about it, some brainstorming :

SOME MAJOR SIDETRACK prompts could be ordered by a number before them 01_check... 02_get...

prompts could be sequential, meaning it requires a valid certain valid prompt before. so prompts should be able to set variables...

    - role : store
      store_variablename : "text"
# example 
    - role : store
      store_expertise : "medical knowledge"
    - role : store
      store_expertise_level : "Specialized knowledge", "Advanced knowledge", "Domain expertise","Cutting-edge knowledge"
    - role : store
      store_credibility : "academic paper", "endorsed by medical association", "reviewed by peers"

and an other prompt could :

 - role : require
      store_expertise : "medical knowledge"
- role: system
  content: |
   ...
- role: user
  content: |
      Text : """{text}"""
      Question : ....

Some quick examples that would need more work:

 - role : require
      store_expertise_level : "Specialized knowledge"
- role: system
  content: |
   ...
- role: user
  content: |
      Text : """{text}"""
      Task : Define the credibility of this text and why would it be credible ? Answer as a bullet list with short sentenses using the format : 
      """
      CREDIBILITY_LEVEL:
           - level of credibility (possible answers = high, medium, low)
      CREDIBILITY_REASONS : 
           - First reason of credibility
           - Second reason of credibility
           - Third reason of credibility
     """
 - role : store
      store_credibility_level : "high"
      store_credibility_reasons : "academic paper", "endorsed by medical association", "reviewed by peers"

This kind of thing would make it more flexible. Should YAML be used that way, not sure if it's the best solution.

I hope this helps.

P.S. My idea is to get similar knowledge domain data, concepts across multiple publication and gather them (using vector RAG?) to try to generate coherent and valid deeper knowledge on a subject from publications/notes/books that may not be available online... That knowledge must be explainable, hence it must be able to cite source material, author...

Ok, back to debugging to fix my Answer relevancy validation failed! tests.

johnr14 commented 3 weeks ago

While YAML "scripting" the pipeline may still be a goal, I think I figured out how to do it. Will open issue again on this if needed or submit code for pull request if I get it to work.

Also, using JSON is so much better, I can have a JSON template as well as JSON with values for fail or pass, and support multiple language by changing the JSON file or translating it on the fly.