e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.
MIT License
976 stars 135 forks source link

First input file is the only one processed #61

Open mjh624 opened 2 weeks ago

mjh624 commented 2 weeks ago

Our input folder contains 11 files. All appear to be read in: Successfully read file: ./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/innovationqkb.WordPress.2024-07-26.xml.md JSON file saved successfully. Successfully read file: ./input/ipcomkb.WordPress.2024-07-26.xml.md JSON file saved successfully. Successfully read file: ./input/iqideaskb.WordPress.2024-07-26.xml.md JSON file saved successfully. Successfully read file: ./input/ipcomkb.faq.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/priorartdatabasekb.WordPress.2024-07-26.xml.md JSON file saved successfully. Successfully read file: ./input/iqideaskb.faq.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/innovationqkb.faq.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/priorartdatabasekb.glossary.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/iqideaskb.glossary.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/priorartdatabasekb.faq.WordPress.2024-07-27.xml.md Pretraining set created.

However, only the first file: ./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md has question/answer pairs produced.

The augmentoolkit output messages do not appear to give an indication as to whether there is an issue. COMPLETED PHASE 0 Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/a0db9260-500e> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/a7bc5e7d-950c> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/bd42e735-6bba> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/c6ddcde3-8678> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/15a96815-222f> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/91488b07-8c85> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/f8913bd2-afb5> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/8ef1f903-2906> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/ee46ce5b-8461> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/86b05937-c2a0> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/1b6b185c-b3fd> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/0adb9229-3210> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/d51cf6a9-1745> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/921636fe-4076> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/456d0805-b8c7> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/f09a3fd9-3fcf> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/bf45b67b-81da> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/01b56723-abdf> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/b89ac365-1286> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/1f0d3583-f9e7> COMPLETED PHASE 1

Each file written in phase 1 appears to correspond to questions/answers related to a paragraph in the document: ./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md

What are some possible reasons that files in the input folder are skipped?

mjh624 commented 2 weeks ago

A check of the metadata field in the qa_tuples_filtered folder shows only the first file lead to question/answer pairs:

grep -r metadata qatuples_filtered qatuples_filtered/para_6_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_5.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_0_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_19_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_18_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_19_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_6.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_6_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_4_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_15_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_3_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_12_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_12_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_6_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_5.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_3_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_2_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_18_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_3_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_1_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_0_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_6_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_19_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_18_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_4_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_12_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_1_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_3_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_0_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_15_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_19_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_6.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_2_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_2_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_18_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_2_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_15_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_1_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_6.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_5.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_1_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_6_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_0_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_1_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_0_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_3_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_4_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_15_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",

e-p-armstrong commented 2 weeks ago

Hmm that's strange. I'm not able to reproduce this using the toy example the repo starts with, so that leaves a few possibilities: 1) we're running into an edge case with the code that isn't triggered with the three default input files 2) somehow everything sourced from the other files is failing validation and never gets to question generation 3) All questions made from those files get dropped for some reason 4) something else

Would you be against sharing your input files and maybe your config so I can try to repro it on my end, or is that stuff confidential?

mjh624 commented 1 week ago

Here is the config file: API: API_KEY: xxxx BASE_URL: http://localhost:11434/v1
LARGE_LOGICAL_MODEL: llama3.1:70b LOGICAL_MODEL: llama3.1:70b HUGGINGFACE: HUB_PATH: < our info here > PRIVATE: False PUSH_TO_HUB: False PATH: DEFAULT_PROMPTS: ./prompts INPUT: ./input OUTPUT: ./output PROMPTS: ./prompts PHASE: PHASE_INDEX: 3 WORK_IN_PHASES: False SKIP: ANSWER_RELEVANCY_CHECK: False FILTER_CHUNKS: True QUESTION_CHECK: False SYSTEM: CHUNK_SIZE: 1900 COMPLETION_MODE: False CONCURRENCY_LIMIT: 3 CONVERSATION_INSTRUCTIONS: For this conversation, you are generating a chat between a generalist, generic AI assistant, and a human. DOUBLE_CHECK_COUNTER: 1 DO_NOT_USE_SYSTEM_PROMPTS: True FINAL_ASSISTANT_PROMPT_NO_RAG: 'You are a helpful AI assistant.

'

FINAL_ASSISTANT_PROMPT_RAG: 'You are a helpful AI assistant.

Context information is below:

----------------------

{data}

'

MODE: api STOP: True SUBSET_SIZE: 15 USE_FILENAMES: False USE_SUBSET: False

Unfortunately, I cannot share our input files.

I found the config file that was used to process army training manuals:

https://github.com/e-p-armstrong/augmentoolkit/blob/master/original/config_overrides/army_model/config.yaml

I modified the config to use our input files and model and it appears that most, if not all input files now are being processed. However, processing started 10/1/2024 and after 4 days, it is still processing. I would like to understand which settings may have allowed the other files to process, and, why is it taking so much longer.