e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.
MIT License
1.03k stars 142 forks source link

list Index out of range issue #77

Open mjh624 opened 1 day ago

mjh624 commented 1 day ago

I am processing a doc with augmentoolkit and encountering an error/exception, "list index out of range" in generation_functions/engine_wrapper_class.py


async def submit_completion( ... ...
...
async for chunk in stream: try: if chunk.choices[0].delta.content: completion = completion + chunk.choices[0].delta.content except Exception as e: print("\n\n------------CAUGHT EXCEPTION DURING GENERATION") print("chunk: ", chunk) print("completion: ", completion) print(e) timed_out = True print("\n\n-----/------") ...

I am printing "chunk" and see that chunk.choices is an empty array. However, I don't know why it is empty and if it is OK.

-----/------ Output written to ./outputIpcom/check_question_generations/2f9a1af1-0383-4490-b6e3-23f1a4111a0c--subquestion--82d10b0d-3200-418e-af27-936337e88ea8--check--6203cd6a-b8e4-4d9d-afb5-0fea6f70ce69.yaml 2024-11-15 21:22:02,682 - INFO - HTTP Request: POST http://localhost:9000/v1/chat/completions "HTTP/1.1 200 OK"

------------CAUGHT EXCEPTION DURING GENERATION chunk: ChatCompletionChunk(id='chat-67e86151a823496ab7d12db8eefe6689', choices=[], created=1731705722, model='mistralai/mistral-large', object='chat.completion.chunk', service_tier=None, system_fingerprint='fp_example', usage=None) completion: ## Reasoning and thought process:

Text Analysis:

Identify Key Information: The text provides a step-by-step guide on how to view current and historical invoices.

Categorize Information Type: The information is procedural, outlining specific actions to be taken on a platform.

Answer Breakdown:

Dissect the Answer: The answer describes the steps to view invoices, mentioning the "Subscription" tab and the "Billing History" section.

Identify Answer Type: The statement is a procedural guide, reflecting the steps outlined in the text.

Accuracy Check:

Direct Comparison for Factual Accuracy:

Final Judgment:

Comprehensive Assessment: The answer accurately reflects the steps described in the text for viewing invoices.

Overall Accuracy Determination: Accurate.

list index out of range

I have attached the config.yaml.

Augmentoolkit finishes processing the documents and produces Q/A pairs: drwxr-xr-x 2 root root 28672 Nov 15 21:24 check_answer_accuracy_generations drwxr-xr-x 2 root root 28672 Nov 15 21:24 check_question_generations drwxr-xr-x 4 root root 4096 Nov 15 21:14 judge_paragraph_generations -rw-r--r-- 1 root root 81303 Nov 15 21:14 judge_paragraph_generations_DATAGEN_OUTPUT.jsonl -rw-r--r-- 1 root root 246317 Nov 15 21:29 master_list.jsonl -rw-r--r-- 1 root root 38109 Nov 15 21:29 plain_qa_list.jsonl -rw-r--r-- 1 root root 27008 Nov 15 21:13 pretraining.jsonl drwxr-xr-x 2 root root 4096 Nov 15 21:24 qatuples_filtered drwxr-xr-x 4 root root 4096 Nov 15 21:24 question_context_revision_generations drwxr-xr-x 4 root root 4096 Nov 15 21:14 question_generation_generations -rw-r--r-- 1 root root 106407 Nov 15 21:29 questions_generation_dataset.jsonl root@0797d8d75562:/tmp/augmentoolkit#

Is the exception problematic and, if so, do you have suggestions how to fix it?

Here is the config.yaml: API: API_KEY: xxxxx BASE_URL: http://localhost:9000/v1 LARGE_LOGICAL_MODEL: mistralai/mistral-large LOGICAL_MODEL: mistralai/mistral-large HUGGINGFACE: HUB_PATH: Heralax/test-atk-dataset-do-not-use-3 PRIVATE: False PUSH_TO_HUB: False PATH: DEFAULT_PROMPTS: ./prompts INPUT: /tmp/augmentoolkit/original/inputIpcom OUTPUT: ./outputIpcom PROMPTS: ./prompts PHASE: PHASE_INDEX: 3 WORK_IN_PHASES: False SKIP: ANSWER_RELEVANCY_CHECK: True FILTER_CHUNKS: False QUESTION_CHECK: False CONVERSATION_GENERATION: True REPAIR_QA_TUPLES: True SYSTEM: CHUNK_SIZE: 1900 COMPLETION_MODE: False CONCURRENCY_LIMIT: 3 CONVERSATION_INSTRUCTIONS: For this conversation, you are generating a chat between a generic user, and an assistant. DOUBLE_CHECK_COUNTER: 1 DO_NOT_USE_SYSTEM_PROMPTS: True FINAL_ASSISTANT_PROMPT_NO_RAG: 'You are a helpful assistant.

'

FINAL_ASSISTANT_PROMPT_RAG: 'You are a helpful assistant.

Context information is below:

----------------------

{data}

'

MODE: api STOP: True SUBSET_SIZE: 5000 USE_FILENAMES: False USE_SUBSET: True SCRAPING: USE_GUTENBERG: False START_URL: "https://www.gutenberg.org/ebooks/bookshelf/57" MAX_BOOKS: 5 MAX_FAILURES: 5

e-p-armstrong commented 22 hours ago

Hey Mark! I believe I got back to you over email on this, but to answer your question -- if Augmentoolkit keeps running, it should not be a critical problem. This particular one just means that for one particular attempt on one particular chunk, the model messed up the output format. I should really make the error message less scary...

But you should be good to go!