Open johnr14 opened 1 month ago
This is how I see a better pipeline, but some way of flow control must enable retries and multi-generation. EDIT: updated diagram
Alignment can be a security to make sure it's not sensible information (private like your phone # or bank account?), dangerous information, or any censuring someone may want to put. It could also be bypassed...
I think short questions like : Explain in detail what ______ is when it's in the context of _____.
are great. Because you can spout a large summary of all what it is while still be very domain specific.
Then get more in depth with very precise questions : Explain the role of ______ when this happens ______.
or How can you prevent _____ when _____.
Those are the king of question that should make a LLM smarter (personal opinion not yet verified).
Ok, so I made a test to check if a single pass pre-evaluation was possible with a 8b model. I used a public paper under Creative Commons Attribution 4.0 International License and parsed it with a prompt to extract some sort of overall fingerprint metadata for a file.
This fingerprint will be use to determine how much work (ie tokens) should be used to extract data from it and how much tokens should be produced. (in sort, this is choosing the pipeline efforts and output size)
I think that this could be used to parse at least the first page of a document, more if it fits in context.
This is the prompt :
This proves that a single API call can generate lots of different elements that would be otherwise be distributed over many API calls. Since it output only JSON, it's kinda cheap token wise.
Pretty impressed with a 8b model ! I think it's almost better than 405b for some keys. While not perfect, it's pretty good. So if KnowledgeType is "Scientific Research" and "DepthOfKnowledge" > 6 try to extract more technical terms and more content !
I find that re-running multiple call to 8b with slightly different prompt (and smaller chunks) is more efficient than using a 1x prompt of a larger model. Data extraction works better with smaller chunks and small LLM, 3b will need to be evaluated as a MICRO model and 1.5b as a NANO model. What kind of results can be expected from them (for data extraction at least) ? Those are cheap to run tokens thru be it locally or remotely.
Technical terms extraction would work best on smaller chunks than a whole article like this, but it still did a good job. Perhaps a dictionary based comparison of the JSON output with all terms found in text using regex, and requesting the LLM for the missing terms would be the optimal way to go instead of pooling it multiple times for small chunks.
Forgot to specify only lookup definition in the text, but oh well :
From the following text, list ALL technical terms and be very exhaustive ! Identify and define any specialized terminology, jargon, or technical vocabulary used. Output it to JSON.
EDIT: Just noticed there is an error in the JSON of llama8b for technical terms extraction, a extra "
.
Hey, thanks for making this issue, looks like you're doing some really creative stuff! Very glad that someone's making good use of the boilerplate :)
If you have any specific questions as you develop it please let me know, and when you're done I would not be against merging it into the main project as an official pipeline option if you're open to it. Keep me posted! This looks really cool.
How would it be possible to have a summary made out of a chunk? With a prompt named:
summary_gen.yaml
_EDIT: While I had trouble getting my head around the code in
./original
, I started from scratch withBOILERPLATE_TO_MAKE_YOUR_OWN_PIPELINE
. So I kinda figured how to this on my own._Will close this in a bit when I get it working, for now, I am sharing some of my work in abstracting the pipeline, abusing the prompts and optimizing the quality of the output while minimizing token use.
I have had some passable results by telling it to identify the main theme of the chunk and what domain of knowledge it's about and that any information not in relation to it should be discarded, like publicity or irrelevant mixed text. That would help for generating good content for continuous training. Different levels of shrinking it down could be tried: long summary, summary, short summary.
Summarize the following text by keeping only what is consistent with the main idea, theme, or key points. Remove anything that is not relevant or seems off-topic."
Also, if this works well, it could be possible to preprocess the chunk for later extraction of QA data, validating it against the original chunk. Small 1.5b and 3b models could be used to pump quick and cheap Q-A that could be classified, verified, grouped by
something common
and reworded in a complex Q-A that convey more information.Would have to get it done and compare normal pipeline with a summary based pipeline to see if there is any difference in dataset quality or speed to generate it.
Thanks