Open mjh624 opened 2 weeks ago
A check of the metadata field in the qa_tuples_filtered folder shows only the first file lead to question/answer pairs:
grep -r metadata qatuples_filtered qatuples_filtered/para_6_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_5.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_0_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_19_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_18_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_19_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_6.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_6_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_4_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_15_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_3_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_12_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_12_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_6_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_5.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_3_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_2_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_18_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_3_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_1_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_0_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_6_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_19_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_18_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_4_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_12_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_1_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_3_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_0_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_15_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_19_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_6.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_2_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_2_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_18_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_2_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_15_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_8_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_1_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_6.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_5.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_9_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_1_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_6_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_0_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_1_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_7_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_0_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_3_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_4_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md", qatuples_filtered/para_15_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
Hmm that's strange. I'm not able to reproduce this using the toy example the repo starts with, so that leaves a few possibilities: 1) we're running into an edge case with the code that isn't triggered with the three default input files 2) somehow everything sourced from the other files is failing validation and never gets to question generation 3) All questions made from those files get dropped for some reason 4) something else
Would you be against sharing your input files and maybe your config so I can try to repro it on my end, or is that stuff confidential?
Here is the config file:
API:
API_KEY: xxxx
BASE_URL: http://localhost:11434/v1
LARGE_LOGICAL_MODEL: llama3.1:70b
LOGICAL_MODEL: llama3.1:70b
HUGGINGFACE:
HUB_PATH: < our info here >
PRIVATE: False
PUSH_TO_HUB: False
PATH:
DEFAULT_PROMPTS: ./prompts
INPUT: ./input
OUTPUT: ./output
PROMPTS: ./prompts
PHASE:
PHASE_INDEX: 3
WORK_IN_PHASES: False
SKIP:
ANSWER_RELEVANCY_CHECK: False
FILTER_CHUNKS: True
QUESTION_CHECK: False
SYSTEM:
CHUNK_SIZE: 1900
COMPLETION_MODE: False
CONCURRENCY_LIMIT: 3
CONVERSATION_INSTRUCTIONS: For this conversation, you are generating a chat between
a generalist, generic AI assistant, and a human.
DOUBLE_CHECK_COUNTER: 1
DO_NOT_USE_SYSTEM_PROMPTS: True
FINAL_ASSISTANT_PROMPT_NO_RAG: 'You are a helpful AI assistant.
'
FINAL_ASSISTANT_PROMPT_RAG: 'You are a helpful AI assistant.
Context information is below:
----------------------
{data}
'
MODE: api STOP: True SUBSET_SIZE: 15 USE_FILENAMES: False USE_SUBSET: False
Unfortunately, I cannot share our input files.
I found the config file that was used to process army training manuals:
I modified the config to use our input files and model and it appears that most, if not all input files now are being processed. However, processing started 10/1/2024 and after 4 days, it is still processing. I would like to understand which settings may have allowed the other files to process, and, why is it taking so much longer.
Our input folder contains 11 files. All appear to be read in: Successfully read file: ./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/innovationqkb.WordPress.2024-07-26.xml.md JSON file saved successfully. Successfully read file: ./input/ipcomkb.WordPress.2024-07-26.xml.md JSON file saved successfully. Successfully read file: ./input/iqideaskb.WordPress.2024-07-26.xml.md JSON file saved successfully. Successfully read file: ./input/ipcomkb.faq.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/priorartdatabasekb.WordPress.2024-07-26.xml.md JSON file saved successfully. Successfully read file: ./input/iqideaskb.faq.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/innovationqkb.faq.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/priorartdatabasekb.glossary.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/iqideaskb.glossary.WordPress.2024-07-27.xml.md JSON file saved successfully. Successfully read file: ./input/priorartdatabasekb.faq.WordPress.2024-07-27.xml.md Pretraining set created.
However, only the first file: ./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md has question/answer pairs produced.
The augmentoolkit output messages do not appear to give an indication as to whether there is an issue. COMPLETED PHASE 0 Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/a0db9260-500e> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/a7bc5e7d-950c> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/bd42e735-6bba> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/c6ddcde3-8678> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/15a96815-222f> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/91488b07-8c85> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/f8913bd2-afb5> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/8ef1f903-2906> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/ee46ce5b-8461> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/86b05937-c2a0> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/1b6b185c-b3fd> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/0adb9229-3210> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/d51cf6a9-1745> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/921636fe-4076> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/456d0805-b8c7> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/f09a3fd9-3fcf> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/bf45b67b-81da> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/01b56723-abdf> Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/b89ac365-1286> FAILED TO GENERATE QUESTIONS! Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/1f0d3583-f9e7> COMPLETED PHASE 1
Each file written in phase 1 appears to correspond to questions/answers related to a paragraph in the document: ./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md
What are some possible reasons that files in the input folder are skipped?