Closed johnr14 closed 1 hour ago
I was looking at a test run and some questions will have to be dropped out as they are not related to the knowledge I wish to extract from a pdf.
I am using : cat master_list.jsonl | jq | grep \"question\" | sort
to get the list of questions and I find some like :
Are there any career options available for individuals interested in working in ... # publicity
How do the chances of winning the contest depend on the number of eligible entries received? # publicity for contest ?
Who designs the layout and cover of [...] # copyright ?
Who typically applies the 10th product (described in the tenth bullet point)? # what what ??
Why might reproducing content without permission be problematic? # not copyright again ?
How many pages long is the book [...] # like that's more important than what it's about...
Is the fifth product suitable for clients # what again ?
What happens after the demands and internal references are reviewed by [...] # study information ?
But got some nice one like :
Can you summarize a study by [...]
I rewrote the qatuples_gen_no_filenames.yaml
prompt and added 10 Q-A templates as well as more detailed information on what should be prioritized information. Normal run left 356 valid questions out of 556 generated questions (with a few bad ones like above), the new prompt generated 921 questions that have yet to be validated.... will update later when run finishes. and after validation left 505 questions.
So for a single PDF, went from 356 to 505 questions just by trying to get 10 questions per chunk.
I think there are some major gains still available by reworking the pipeline and prompts. So much that a re-run of your army llm may get some major uplift !
While I still got some out of context questions like :
What is the ISSN number of the journal?
What annual budget does [...] allocate between repaying her mortgage [...]
What return is used in the simulations presented for the three scenarios of acquiring a first home?
What [...] event took place at [...]
What is the preferred contraceptive method chosen by [FIRST NAME] ?
What is the unique aspect of the work-life balance offered by the city of [...]
What are the benefits offered to employees by [...]
What are the choices for the winning prize?
they may be somewhat informative, but are unrelated to what I want to train the LLM on and comes from publicity from within the source material.
Looking at together.ai/pricing, training a 8b LLM for 1 epoch on 1000 tokens cost 5$. On a single PDF, having poor questions is not so bad, but if it's a few years of PDF files of magazine, it could leads to significant cost wasted. That's why I think Q-A grouping and summarization should also be implemented.
I did have one \n\n**ANSWERS:**
that was not well parsed and ended at the end of a previous question.
I have 5 According to the text,
that where generated and should not exist.
For the rest, very impressed by my new results.
Can't wait to have multiple prompting per chunk to extract all pertinent information or use information extraction before generating questions...
Here, let me share my prompt :
Closing, building my own pipeline, had a hard time getting my head around original
.
Hi, I am having trouble getting my head around the code...
I find that 4 questions per chunk could miss important points on high quality condensed academic content.
Suppose I would like to generate more questions per chunk and specify a second file like
qatuples_gen_specialized.yaml
and have a variable to re-run the the generation or swap a {special_instruction } in the prompt. How should it be done ?Q-A would have to be de-duplicated later....
The other way around would be to have the LLM generate a list of all key-points, main ideas, explanations, descriptions or relevant information in the text, then generate questions that can be answered by those points.... That's how I did it in the past.
So should I go and create a task2 in processing.py like this ? but then how do I pass the
qatuples_gen_specialized.yaml
?or create a duplicate class in steps.py
But then I would have to duplicate the class for each new prompt ? And running it twice would it give more Q-A pairs ? I would like to run it once with a special instruction like one from this list :
Thanks