e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.
MIT License
980 stars 135 forks source link

index out of range (between phase 1 and phase 2) #51

Open mjh624 opened 1 month ago

mjh624 commented 1 month ago

I am using the latest augmentool. Phase 1 appears to complete without errors. I am including the config.yaml and the output that was written to the screen. Are there parameters that are missing?

Intermediate files are present: ls -l ../outFiles/ total 6684 drwxr-xr-x 4 root root 4096 Sep 18 14:50 judge_paragraph_generations -rw-r--r-- 1 root root 4776753 Sep 18 15:38 judge_paragraph_generations_DATAGEN_OUTPUT.jsonl -rw-r--r-- 1 root root 2058739 Sep 18 14:50 pretraining.json

Here is the config.yaml: API: API_KEY: xxx BASE_URL: http:// xxx / LARGE_LOGICAL_MODEL: llama3.1 LOGICAL_MODEL: llama3.1 HUGGINGFACE: HUB_PATH: Heralax/test-atk-dataset-do-not-use-3 PRIVATE: False PUSH_TO_HUB: False PATH: DEFAULT_PROMPTS: ./prompts INPUT: ../../trainingFiles OUTPUT: ../../outFiles PROMPTS: ./prompts PHASE: PHASE_INDEX: 3 WORK_IN_PHASES: False SKIP: ANSWER_RELEVANCY_CHECK: False FILTER_CHUNKS: False QUESTION_CHECK: False SYSTEM: CHUNK_SIZE: 1900 COMPLETION_MODE: False CONCURRENCY_LIMIT: 10 CONVERSATION_INSTRUCTIONS: For this conversation, you are generating a chat between a generalist, generic AI assistant, and a human. DOUBLE_CHECK_COUNTER: 1 DO_NOT_USE_SYSTEM_PROMPTS: True FINAL_ASSISTANT_PROMPT_NO_RAG: 'You are a helpful AI assistant.

'

FINAL_ASSISTANT_PROMPT_RAG: 'You are a helpful AI assistant.

Context information is below:

----------------------

{data}

'

MODE: api STOP: True SUBSET_SIZE: 15 USE_FILENAMES: False USE_SUBSET: False

Here is the output just before the error:

{'paragraph': None, 'metadata': '../../trainingFiles/iqideaskb.WordPress.2024-07-26.xml.md'} {'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'} {'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'} {'paragraph': None, 'metadata': '../../trainingFiles/iqideaskb.WordPress.2024-07-26.xml.md'} {'paragraph': None, 'metadata': '../../trainingFiles/priorartdatabasekb.glossary.WordPress.2024-07-27.xml.md'} {'paragraph': None, 'metadata': '../../trainingFiles/innovationqkb.WordPress.2024-07-26.xml.md'} Converting generations to training data entering saving mode Converting ../../outFiles/judge_paragraph_generations/intermediate_generations to a dataset ...Converted successfully (we think) Traceback (most recent call last): File "/tmp/augmentoolkit-master/original/processing.py", line 374, in asyncio.run(main()) File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result() File "/tmp/augmentoolkit-master/original/processing.py", line 222, in main print(filtered_worthy_for_questions[0]) IndexError: list index out of range

e-p-armstrong commented 1 month ago

Hmm looks like a lot of your paragraphs are being "judged as unworthy for questions" by the paragraph judgement step -- either it thinks they're all metadata, or the model is messing up a lot for some reason. One solution might be to turn off filtering entirely, using SKIP/FILTER_CHUNKS?

Thanks for bringing this up btw, I've added in a recent push a more clear error message:

if len(filtered_worthy_for_questions) == 0:
        print("No paragraphs were judged worthy for questions. Either the judgement step thinks everything you added is metadata or has no factual information, or your input path is wrong, or the model is being stupid. Check your input directory path, your model, and your input data. The intermediate outputs at the end of each file in ./output/judge_paragraph_generations/intermediate_generations/ may help you diagnose the problem.")
        sys.exit(1)

Let me know if turning off filter chunks solves it. Also, I'd maybe be curious to see some of the intermediate outputs in ./output/judge_paragraph_generations/intermediate_generations/, because if there is factual information in your files then it shouldn't be dropping all of them.

mjh624 commented 1 month ago

I am running the most recent release of augmentoolkit in a docker container that is running python 3.11 root@e6e8132f1ba5:/tmp/augmentoolkit# python --version Python 3.11.10

I turned off filter chunks in the config.yaml in the /original folder: SKIP: ANSWER_RELEVANCY_CHECK: False FILTER_CHUNKS: True QUESTION_CHECK: False

augmentoolkit appears to complete phase 1 and fails with a different "index out of range" error: FAILED TO GENERATE QUESTIONS! Output written to /tmp/outFiles/question_generation_generations/question_generation_generations/6d4fd93e-3a9d-458b-bde1-5ff938bdd97b.yaml FAILED TO GENERATE QUESTIONS! Output written to /tmp/outFiles/question_generation_generations/question_generation_generations/7f40d689-700b-4400-b215-0295b077fd90.yaml FAILED TO GENERATE QUESTIONS! Output written to /tmp/outFiles/question_generation_generations/question_generation_generations/4a970d6c-4d1b-4569-9ceb-7f7da41645e0.yaml COMPLETED PHASE 1 asyncio.run(main()) File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/tmp/augmentoolkit/original/processing.py", line 264, in main print(generated_qa_dicts[0])


IndexError: list index out of range
Augmentoolkit is starting to run! If this is your first time running this it might take a few moments to start due to imports and such.
root@e6e8132f1ba5:/tmp/augmentoolkit# 

I do not see a file: ./output/judge_paragraph_generations/intermediate_generations/ 
root@e6e8132f1ba5:/tmp/augmentoolkit# ls outFiles/
pretraining.json
root@e6e8132f1ba5:/tmp/augmentoolkit# ls ../outFiles/
pretraining.json                 qatuples_filtered/               question_generation_generations/ 
root@e6e8132f1ba5:/tmp/augmentoolkit#