e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.
MIT License
1.05k stars 144 forks source link

Create a chunk summary and discarting irrevelant information. #71

Open johnr14 opened 1 month ago

johnr14 commented 1 month ago

How would it be possible to have a summary made out of a chunk? With a prompt named: summary_gen.yaml

_EDIT: While I had trouble getting my head around the code in ./original, I started from scratch with BOILERPLATE_TO_MAKE_YOUR_OWN_PIPELINE. So I kinda figured how to this on my own._

Will close this in a bit when I get it working, for now, I am sharing some of my work in abstracting the pipeline, abusing the prompts and optimizing the quality of the output while minimizing token use.


I have had some passable results by telling it to identify the main theme of the chunk and what domain of knowledge it's about and that any information not in relation to it should be discarded, like publicity or irrelevant mixed text. That would help for generating good content for continuous training. Different levels of shrinking it down could be tried: long summary, summary, short summary.

Summarize the following text by keeping only what is consistent with the main idea, theme, or key points. Remove anything that is not relevant or seems off-topic."

Also, if this works well, it could be possible to preprocess the chunk for later extraction of QA data, validating it against the original chunk. Small 1.5b and 3b models could be used to pump quick and cheap Q-A that could be classified, verified, grouped by something common and reworded in a complex Q-A that convey more information.

Would have to get it done and compare normal pipeline with a summary based pipeline to see if there is any difference in dataset quality or speed to generate it.

Thanks

johnr14 commented 1 month ago

This is how I see a better pipeline, but some way of flow control must enable retries and multi-generation. update drawio EDIT: updated diagram

Alignment can be a security to make sure it's not sensible information (private like your phone # or bank account?), dangerous information, or any censuring someone may want to put. It could also be bypassed...

I think short questions like : Explain in detail what ______ is when it's in the context of _____. are great. Because you can spout a large summary of all what it is while still be very domain specific.

Then get more in depth with very precise questions : Explain the role of ______ when this happens ______. or How can you prevent _____ when _____. Those are the king of question that should make a LLM smarter (personal opinion not yet verified).

johnr14 commented 1 month ago

Ok, so I made a test to check if a single pass pre-evaluation was possible with a 8b model. I used a public paper under Creative Commons Attribution 4.0 International License and parsed it with a prompt to extract some sort of overall fingerprint metadata for a file.

This fingerprint will be use to determine how much work (ie tokens) should be used to extract data from it and how much tokens should be produced. (in sort, this is choosing the pipeline efforts and output size)

I think that this could be used to parse at least the first page of a document, more if it fits in context.

This is the prompt :

prompt ``` You are an expert educational AI with the capability to efficiently analyze and interpret text content. You can understand key elements and identify the purpose and main topic in a text. Your ability to engage with the text critically allows you to provide users with a deeper understanding of the content they are working with. This is how you understanding the key elements and identify the purpose and main topic in a text : To comprehend the essential components and determine the purpose and main topic of any text, start by examining its headline and summary, if available. The headline often provides a clear indication of the subject matter, while the summary offers a concise overview of the entire content, including its objectives, strategies, findings, outcomes, and significance. When analyzing a chunk of text, whether it's a single paragraph or a larger section, begin by considering the context surrounding it. This can provide valuable clues about its purpose and central idea. Look for explicit statements that directly state the main topic or purpose, which can often be found at the beginning or end of the chunk, serving as introductions or conclusions to the main idea being discussed. Within the text, pay attention to the beginning and ending parts. The beginning section frequently includes the aims, inquiries, hypotheses, or objectives the author is addressing, while the ending part discusses the results, their understanding, and implications, effectively summarizing the content. The principal subject or central idea is usually introduced by the author in the start and concluded or re-emphasized at the end, with the body or discussion of the content expanding on or elaborating this main subject through various arguments or proofs. As you read, consciously identify the underlying themes and ideas that guide the author's narrative. Take note of explicit statements, recurring patterns, and frequently appearing terms, as these often signal the central theme or proposition. By recognizing these recurrent elements and following their development throughout the content, you'll gain insight into the author's primary focus and the main message they aim to convey. Keep in mind that identifying the purpose and main topic of a text, especially in smaller chunks, requires critical reading and an understanding of the surrounding context. Consider the implications, nuances, and underlying context or perspectives presented in the text, as these can provide valuable information about the author's intentions and the main idea being communicated. By actively engaging with the text, examining context, recognizing recurring themes and patterns, and employing these strategies, you can effectively identify the purpose, main topic, and key elements of any chunk of text, regardless of its length or origin. Analyze the text's content and identify: Author: Note the mention of one or more authors. Concepts Explained: List the key concepts presented or explained. Context: What is the context of this document? Is it specific to a part of the world, for a specific culture, for general use, for school teaching material, or any other that you may identify? Credibility: By looking at the author, publisher, and source, evaluate its credibility on a score of 0-10. 0 is none and 10 is from a reputable author and reputable publisher or source. Depth of Knowledge: What level of knowledge is this on a scale of 10? 0-1 for stupidity, comedy, trolling, 3 for fiction and imaginary worlds, 4-5 for general knowledge, 6-7-8 for intellectual knowledge that requires moderate to high academic comprehension, 9-10 for expert knowledge or knowledge in science, mathematics, medicine, pharmacology, psychology, or any other field of research. Domain of Knowledge: Determine the subject area or field of study referenced. Knowledge Type: What kind of knowledge is conveyed in this text? Language: In what language is the text written? Overall Message: Determine the central idea or main point the author aims to convey. Publisher: Note if a publisher is mentioned for this work. Source: Note if this is from a company, an organization, a government, or any other social entity. Summary: Distill the text's essence and its main subjects and key points in a condensed few lines. Technical Terms: Identify and define any specialized terminology, jargon, or technical vocabulary used. Type of Material: Try to identify the provenance of the material. Is it from a book, a website, a magazine, a flyer, course notes, or any other type you may identify? Usefulness: Is this conveying some useful information, on a scale of 0 to 10? 0 is nonsense and 10 is expertise on a subject. Read, analyze, and interpret the text content but only relate to the text and what information it provides. **Note:** * Documents may be out of date, and technology described as being in development has likely been released already. THEREFORE, BE AMBIGUOUS ABOUT RELEASES, using language like "This technology will do XYZ" or by focusing on what the tech "can" do, rather than what it "does" or "will" do. * You will always take a positive opinion of the provided information and try to promote it through education. * Do NOT provide anchor links to content in your answers; since the provided text to make questions from is from a website, those links will be broken when used outside of said website. So you should NEVER have any content like [some text](#!/some/path) in your answer! External links should not be given either. NO LINKS. You can only generate JSON objects that accurately represents the extracted information, following the established format and structure. You must strictly adhere to the format : """JSON { "Author": [], "ConceptsExplained": [], "Context": [], "Credibility": 0, "DepthOfKnowledge": 0, "DomainOfKnowledge": [], "KnowledgeType": [], "Language": [], "OverallMessage": [], "Publisher": [], "Source": "", "Summary": [], "TechnicalTerms": [], "TypeOfMaterial": "", "Usefulness": 0 } """ ```
The output of llama8b ```JSON { "Author": ["Jay Shah"], "ConceptsExplained": ["Antimicrobial Resistance (AMR)", "Antibiotic Use", "Global Economic Impact", "Surgical Site Infections", "Methicillin-Resistant Staphylococcus aureus (MRSA)", "Tuberculosis (TB)", "Malaria", "HIV", "Escherichia coli", "Klebsiella pneumoniae"], "Context": ["Global Health", "Economics", "Medicine", "Infectious Diseases"], "Credibility": 8, "DepthOfKnowledge": 9, "DomainOfKnowledge": ["Microbiology", "Epidemiology", "Economics", "Public Health"], "KnowledgeType": ["Scientific Research", "Review Article"], "Language": ["English"], "OverallMessage": ["Antimicrobial Resistance is a growing global health concern with significant economic and human implications"], "Publisher": ["Centre for Disease Control and Prevention", "The Economist", "KPMG LLP"], "Source": ["Pre-print article", "Journal articles", "Reports"], "Summary": ["The article discusses the growing concern of Antimicrobial Resistance (AMR) and its impact on global health and economy. The authors highlight the need for educational, political, economic, medical, and social factors to address this issue."], "TechnicalTerms": ["Antimicrobial Resistance (AMR)", "Antibiotic Use", "Surgical Site Infections", "Methicillin-Resistant Staphylococcus aureus (MRSA)", "Tuberculosis (TB)", "Malaria", "HIV", "Escherichia coli", "Klebsiella pneumoniae"], "TypeOfMaterial": ["Research Article", "Report"], "Usefulness": 9 } ```
EDIT: wanted to try 405b ```JSON { "Author": ["Jay Shah"], "ConceptsExplained": [ "Antimicrobial resistance (AMR) and its impacts", "Factors contributing to AMR", "Global economic impacts of AMR", "Challenges in developing new antibiotics" ], "Context": ["Global health issue"], "Credibility": 8, "DepthOfKnowledge": 7, "DomainOfKnowledge": ["Healthcare", "Economics"], "KnowledgeType": ["Factual", "Analytical"], "Language": ["English"], "OverallMessage": ["AMR is a serious global health threat with significant economic impacts that requires coordinated efforts to address."], "Publisher": [], "Source": "Centre for Disease Control and Prevention", "Summary": ["This article discusses the growing threat of antimicrobial resistance (AMR), its causes, potential economic impacts, and challenges in developing new antibiotics. It highlights the need for global efforts to combat AMR."], "TechnicalTerms": [ "Antimicrobial resistance (AMR)", "Staphylococcus aureus", "Escherichia coli", "Klebsiella pneumoniae", "Antibiotic consumption" ], "TypeOfMaterial": "Article", "Usefulness": 8 } ```

This proves that a single API call can generate lots of different elements that would be otherwise be distributed over many API calls. Since it output only JSON, it's kinda cheap token wise.

Pretty impressed with a 8b model ! I think it's almost better than 405b for some keys. While not perfect, it's pretty good. So if KnowledgeType is "Scientific Research" and "DepthOfKnowledge" > 6 try to extract more technical terms and more content !

I find that re-running multiple call to 8b with slightly different prompt (and smaller chunks) is more efficient than using a 1x prompt of a larger model. Data extraction works better with smaller chunks and small LLM, 3b will need to be evaluated as a MICRO model and 1.5b as a NANO model. What kind of results can be expected from them (for data extraction at least) ? Those are cheap to run tokens thru be it locally or remotely.

Technical terms extraction would work best on smaller chunks than a whole article like this, but it still did a good job. Perhaps a dictionary based comparison of the JSON output with all terms found in text using regex, and requesting the LLM for the missing terms would be the optimal way to go instead of pooling it multiple times for small chunks.

Forgot to specify only lookup definition in the text, but oh well :

From the following text, list ALL technical terms and be very exhaustive ! Identify and define any specialized terminology, jargon, or technical vocabulary used. Output it to JSON.

llama8b technical terms ```JSON { "terms": [ { "term": "Antimicrobial Resistance (AMR)", "definition": "A phenomenon where bacteria and other pathogens develop the ability to evade or defeat the drugs designed to eliminate them." }, { "term": "Antimicrobial use (AMU)", "definition": "The use of antimicrobial agents, such as antibiotics, to treat or prevent infections." }, { "term": "Broad-spectrum antibiotics", "definition": "Antibiotics that are effective against a wide range of bacteria, including both Gram-positive and Gram-negative bacteria." }, { "term": "Gram-positive bacteria", "definition": "Bacteria that have a thick peptidoglycan cell wall and a positive charge, examples include Staphylococcus aureus and Enterococcus faecalis." }, { "term": "Gram-negative bacteria", "definition": "Bacteria that have a thinner peptidoglycan cell wall and a negative charge, examples include Escherichia coli and Pseudomonas aeruginosa." }, { "term": "Methicillin-resistant Staphylococcus aureus (MRSA)", "definition": "A type of Staphylococcus aureus that is resistant to methicillin, a type of beta-lactam antibiotic." }, { "term": "Carbapenem-resistant Enterobacteriaceae (CRE)", "definition": "A type of bacteria that is resistant to carbapenem, a broad-spectrum antibiotic." }, { "term": "Extensively drug-resistant tuberculosis (XDR-TB)", "definition": "A type of tuberculosis that is resistant to multiple drugs." }, { "term": "Intrinsic resistance", "definition": "A natural property of bacteria that makes them resistant to certain antibiotics due to the structure of their cell wall or other bacterial components." }, { "term": "Acquired resistance", "definition": "Resistance that develops in bacteria due to exposure to an antibiotic, which can occur through genetic mutations or changes in gene expression." }, { "term": "Beta-lactam antibiotics", "definition": "A class of antibiotics that interfere with the bacterial cell wall synthesis, examples include penicillin and cephalosporins." }, { "term": "Quinolone antibiotics", "definition": "A class of antibiotics that inhibit DNA replication in bacteria, examples include ciprofloxacin and levofloxacin." }, { "term": "Aminoglycoside antibiotics", "definition": "A class of antibiotics that inhibit protein synthesis in bacteria, examples include gentamicin and tobramycin." }, { "term": "Macrolide antibiotics", "definition": "A class of antibiotics that inhibit protein synthesis in bacteria, examples include azithromycin and clarithromycin." }, { "term": "Fluoroquinolone antibiotics", "definition": "A class of antibiotics that inhibit DNA replication in bacteria, examples include ciprofloxacin and levofloxacin." }, { "term": "Gut microbiome", "definition": "The collection of microorganisms that live in the human gut, which play a crucial role in digestion and immune system function." }, { "term": "Probiotics", "definition": "Live microorganisms that are similar to the beneficial bacteria found in the human gut, which can help to restore balance to the gut microbiome." }, { "term": "Synthetic biology", "definition": "The design and construction of new biological systems, such as microbes, to perform specific functions, such as the production of biofuels or other chemicals." }, { "term": "Precision medicine", "definition": "An approach to healthcare that involves tailoring medical treatment to the individual needs of each patient, based on their unique genetic and environmental characteristics." }, { "term": "Pharmacogenomics", "definition": "The study of how people's genetic differences affect their response to different medications." }, { "term": "Patent expiration", "definition": "The end of the time period during which a company has exclusive rights to produce and sell a particular product, such as an antibiotic." }, { "term": "Patent pool", "definition": "A mechanism that allows multiple companies to work together to develop and share intellectual property, such as patents, related to a particular product or technology." }, { "term": "WTO", "definition": "The World Trade Organization, an international organization that promotes free trade and sets rules for international trade." }, { "term": "TRIPS", "definition": "The Agreement on Trade-Related Aspects of Intellectual Property Rights, an international agreement that sets rules for intellectual property protection and enforcement." }, { "term": "WHO", "definition": "The World Health Organization, an international organization that promotes global health and sets standards for healthcare." }, { "term": "Pneumonia", "definition": "An infection of the lungs that can be caused by bacteria, viruses, or other microorganisms." }, { "term": "Resistance genes", "definition": "Genes that confer resistance to antibiotics, which can be transmitted horizontally between bacteria." " }, { "term": "Horizontal gene transfer", "definition": "The transfer of genes between bacteria through mechanisms such as conjugation, transformation, or transduction." }, { "term": "Transformation", "definition": "The direct uptake and incorporation of free DNA molecules from the environment into a bacterial cell." }, { "term": "Transduction", "definition": "The transfer of DNA from one bacterium to another through a viral vector." }, { "term": "Conjugation", "definition": "The transfer of DNA between bacteria through direct cell-to-cell contact." } ] } ```
llama405b technical terms ```JSON { "technical_terms": [ { "term": "antimicrobial resistance (AMR)", "definition": "a phenomenon wherein bacteria and other pathogens develop the ability to evade or defeat the drugs designed to eliminate them" }, { "term": "resistant infections", "definition": "infections caused by pathogens that have developed resistance to one or more antimicrobial agents" }, { "term": "post-operative infections", "definition": "infections that occur after a surgical procedure" }, { "term": "extensively drug-resistant tuberculosis (XDR-TB)", "definition": "a rare disease that can become more common and pose a substantial mortality threat, characterized by resistance to at least four of the core anti-TB drugs" }, { "term": "carbapenem-resistant Enterobacteriaceae (CRE)", "definition": "bacteria that are resistant to carbapenems, which are broad-spectrum antibiotics that are often reserved as a last resort for treating resistant infections" }, { "term": "methicillin-resistant Staphylococcus aureus (MRSA)", "definition": "a rare disease that can become more common and pose a substantial mortality threat, caused by a strain of Staphylococcus aureus resistant to methicillin and other beta-lactam antibiotics" }, { "term": "broad-spectrum antibiotics", "definition": "antibiotics that are effective against a wide range of bacteria" }, { "term": "antibiotic prophylaxis", "definition": "the preventive use of antibiotics to reduce the risk of infection" }, { "term": "intrapartum antibiotic prophylaxis", "definition": "the administration of antibiotics during labor to prevent sepsis in infants" }, { "term": "intestinal microbiomes", "definition": "the community of microorganisms living in the gut" }, { "term": "antimicrobial resistance genes", "definition": "genes that confer resistance to antimicrobial agents" }, { "term": "prospective controlled cohort study", "definition": "a type of study that follows a group of people over time and compares those exposed to a particular factor with those who are not" }, { "term": "low-and-middle income countries (LMIC)", "definition": "countries with a lower economic status based on their Gross National Income (GNI) per capita" }, { "term": "antibiotic use (AMU)", "definition": "the consumption or utilization of antibiotic medications" }, { "term": "intensive livestock production", "definition": "a system of raising animals in which large numbers of livestock are kept in confined spaces, often with the use of antibiotics to promote growth and prevent disease" }, { "term": "World Organization for Animal Health (WOAH)", "definition": "an intergovernmental organization responsible for improving animal health worldwide" }, { "term": "agri-food systems", "definition": "the interconnected processes involved in the production, processing, distribution, and consumption of food" }, { "term": "medically important antimicrobials", "definition": "antimicrobial agents that are critical for human medicine and should be used judiciously in animals to minimize the risk of resistance development" }, { "term": "human-animal interface", "definition": "the points of contact between humans and animals, where pathogens can be transmitted between species" }, { "term": "point-of-care diagnostic kits", "definition": "devices that allow for rapid testing and diagnosis at the point of patient care, without the need for specialized laboratory equipment" }, { "term": " GLOBAL GDP", "definition": "the total value of all goods and services produced worldwide in a given year" }, { "term": "reserve antibiotics", "definition": "antibiotics that are held in reserve and used only for the most resistant infections, to slow the development of resistance" } ] } ```

EDIT: Just noticed there is an error in the JSON of llama8b for technical terms extraction, a extra ".

e-p-armstrong commented 1 month ago

Hey, thanks for making this issue, looks like you're doing some really creative stuff! Very glad that someone's making good use of the boilerplate :)

If you have any specific questions as you develop it please let me know, and when you're done I would not be against merging it into the main project as an official pipeline option if you're open to it. Keep me posted! This looks really cool.