explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.31k stars 746 forks source link

[R-290] Failed to parse output. Returning None. - SimpleEvolution - TestsetGenerator #945

Closed JPonsa closed 1 month ago

JPonsa commented 6 months ago

I am getting an “Failed to parse output. Returning None” error. I I have tried llama3 and mistral8x7b. I believe both models should be able to generate json like outputs.

I need advice on how to solve this.

This could be related to https://github.com/explodinggradients/ragas/issues/859


    splitter = RecursiveJsonSplitter(max_chunk_size=2_000)
    docs = splitter.create_documents(texts=studies)

    generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)

      eval_ds = generator.generate_with_langchain_docs(
        docs,
        test_size=50,
        distributions={
            simple: 0.4,
            reasoning: 0.4,
            multi_context: 0.2,
        },
        raise_exceptions=True,
        is_async=True # as per https://github.com/explodinggradients/ragas/issues/709
    )
    eval_ds.to_pandas().to_csv(args.output)

Error: Note: I had to trim the SimpleEvolution message as it was reporting the content of many documents and their embedding.

Failed to parse output. Returning None.
Failed to parse output. Returning None.
max retries exceeded for SimpleEvolution(generator_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=60, max_retries=15, max_wait=90, max_workers=16, exception_types=<class 'Exception'>)), docstore=InMemoryDocumentStore(splitter=<langchain_text_splitters.base.TokenTextSplitter object at 0x2b813d22f410>, nodes=[Node(page_content="NCT00000173: protocolSection: identificationModule: nctId: NCT00000173, organization: fullName: National Institute on Aging (NIA), class: NIH, briefTitle: Memory Impairment Study (Mild Cognitive Impairment Study), officialTitle: A Randomized, Double-Blind, Placebo-Controlled Trial of Vitamin E and Donepezil HCL (Aricept) to Delay Clinical Progression From Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD), statusModule: overallStatus: COMPLETED, sponsorCollaboratorsModule: leadSponsor: name: National Institute on Aging (NIA), class: NIH, descriptionModule: briefSummary: The National Institute on Aging (NIA) is launching a nationwide treatment study targeting individuals with mild cognitive impairment (MCI), a condition characterized by a memory deficit, but not dementia. An NIA-funded study recently confirmed that MCI is different from both dementia and normal age-related changes in memory. Accurate and early evaluation and treatment of MCI individuals might prevent further cognitive decline, including development of Alzheimer's disease (AD). The Memory Impairment Study is the first such AD prevention clinical trial carried out by NIH, and will be conducted at 65-80 medical research institutions located in the United States and Canada. This study will test the usefulness of two drugs to slow or stop the conversion from MCI to AD. The trial will evaluate placebo, vitamin E, and donepezil, an investigational agent approved by the Food and Drug Administration for another use. Vitamin E (alpha-tocopherol) is thought to have antioxidant properties, and was shown in a 1997 study to delay important dementia milestones, such as patients' institutionalization or progression to severe dementia, by about seven months.", metadata={'filename': 'NCT00000173'}, doc_id='4a458ce7-7dd9-41c2-98f4-ddc032b683b7'), Node(page_content="NCT00000173: protocolSection: conditionsModule: conditions: Alzheimer Disease; keywords: Mild cognitive impairment, Alzheimer's disease, Memory, Donepezil, Vitamin E, Antioxidants, Cholinergic agents, Cholinesterase inhibitors; designModule: studyType: INTERVENTIONAL, phases: PHASE3; designInfo: allocation: RANDOMIZED, interventionModel: PARALLEL, primaryPurpose: TREATMENT, maskingInfo: , armsInterventionsModule: interventions: type: DRUG, name: Donepezil, type: DRUG, name: Vitamin E; eligibilityModule: eligibilityCriteria: Inclusion Criteria: * Memory complaints and memory difficulties which are verified by an informant. * Abnormal memory function documented by scoring below the education adjusted cutoff on the Logical Memory II subscale (Delayed Paragraph Recall) from the Wechsler Memory Scale - Revised (the maximum score is 25): a) less than or equal to 8 for 16 or more years of education, b) less than or equal to 4 for 8-15 years of education, c) less than or equal to 2 for 0-7 years of education. * Mini-Mental Exam score between 24 and 30 (inclusive) (Exceptions may be made for subjects with less than 8 years of education at the discretion of the project director.). * Clinical Dementia Rating = 0.5. Memory Box score must be at least 0.5. * General cognition and functional performance sufficiently preserved such that a diagnosis of Alzheimer's disease cannot be made by the site physician at the time of the screening visit. * No significant cerebrovascular disease: Modified Hachinski score of less than or equal to 4. * Age between 55 and 90 (inclusive). * Permitted medications stable for at least 1 month prior to screening. In particular: a) Subjects may take stable doses of antidepressants lacking significant anticholinergic side effects (if they are not currently depressed and do not have a history of major depression within the past 2 years). b) Estrogen replacement therapy is permissible. c) Ginkgo biloba is permissible, but discouraged. * Hamilton Depression rating scale score of less than or equal to 12 on the 17-item scale. * Informant is available who has frequent contact with the subject (e.g. an average of 10 hours per week or more), agrees to monitor administration of study drug, observe for adverse events, and accompany the subject to all clinic visits for the duration of the protocol. * CT or MRI scans within 12 months prior to screening without evidence of infection, infarction, or other focal lesions and without clinical symptoms suggestive of intervening neurological disease. A lacune in a non-critical brain area which is not believed to contribute to the subject's cognitive impairment is permissible. * Adequate visual and auditory acuity to allow neuropsychological testing. * Good general health with no additional diseases expected to interfere with the study. * Normal B12, RPR, and Thyroid Function Tests or without any clinically significant abnormalities that would be expected to interfere with the study. * ECG without clinically significant abnormalities that would be expected to interfere with the study. * Subject is not pregnant, lactating, or of childbearing potential (i.e. women must be two years post-menopausal or surgically sterile). * Agreement not to take other vitamin supplements (including Vitamin E), multivitamins, other than those provided by the study. Exclusion Criteria: * Any significant neurologic disease other than suspected incipient Alzheimer's disease, such as Parkinson's disease, multi-infarct dementia, Huntington's disease, normal pressure hydrocephalus, brain tumor, progressive supranuclear palsy, seizure disorder, subdural hematoma, multiple sclerosis, or history of significant head trauma followed by persistent neurologic defaults or known structural brain abnormalities. * Major depression or another major psychiatric disorder as described in DSM IV within the past 2 years. * Psychotic features, agitation or behavioral problems within the last 3 months which could lead to difficulty complying with the protocol. * History of alcohol or substance abuse or dependence within the past 2 years (DSM IV criteria). * History of schizophrenia (DSM IV criteria). * Any significant systemic illness or unstable medical condition which could lead to difficulty complying with the protocol including: a) History of systemic cancer within the last 5 years (non-metastatic skin cancers are acceptable). b) History of myocardial infarction within the past year or unstable or severe cardiovascular disease including angina or CHF with symptoms at rest. c) Clinically significant obstructive pulmonary disease or asthma. d) Clinically significant and unstable gastrointestinal disorder such as ulcer disease or a history of active or occult gastrointestinal bleeding within two years. e) Clinically significant laboratory test abnormalities on the battery of screening tests (hematology, prothrombin time, chemistry, urinalysis, ECG). f) Insulin", metadata={'filename': 'NCT00000173'}, doc_id='6de53b72-0e35-4864-8a27-f3524bbe6a90', wins=2), [ mode documents]..., metadata={'filename': 'NCT00000938'}, doc_id='fa58d93c-8e58-4898-93de-59402c15503e')], node_embeddings_list=[[-0.034423138946294785, , 0.01802566833794117]], node_map={'4a458ce7-7dd9-41c2-98f4-ddc032b683b7': 

[...]

node_filter=NodeFilter(llm=LangchainLLMWrapper(run_config=RunConfig(timeout=60, max_retries=15, max_wait=90, max_workers=16, exception_types=<class 'Exception'>)), threshold=1.5, context_scoring_prompt=Prompt(name='score_context', instruction='\n    Given a context, perform the following task and output the answer in VALID JSON format: Assess the provided context and assign a numerical score of 1 (Low), 2 (Medium), or 3 (High) for each of the following criteria in your JSON response:\n\nclarity: Evaluate the precision and understandability of the information presented. High scores (3) are reserved for contexts that are both precise in their information and easy to understand. Low scores (1) are for contexts where the information is vague or hard to comprehend.\ndepth: Determine the level of detailed examination and the inclusion of innovative insights within the context. A high score indicates a comprehensive and insightful analysis, while a low score suggests a superficial treatment of the topic.\nstructure: Assess how well the content is organized and whether it flows logically. High scores are awarded to contexts that demonstrate coherent organization and logical progression, whereas low scores indicate a lack of structure or clarity in progression.\nrelevance: Judge the pertinence of the content to the main topic, awarding high scores to contexts tightly focused on the subject without unnecessary digressions, and low scores to those that are cluttered with irrelevant information.\nStructure your JSON output to reflect these criteria as keys with their corresponding scores as values\n    ', output_format_instruction='The output should be a well-formatted JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output JSON schema:\n```\n{"type": "object", "properties": {"clarity": {"title": "Clarity", "type": "integer"}, "depth": {"title": "Depth", "type": "integer"}, "structure": {"title": "Structure", "type": "integer"}, "relevance": {"title": "Relevance", "type": "integer"}}, "required": ["clarity", "depth", "structure", "relevance"]}\n```\n\nDo not return any preamble or explanations, return only a pure JSON string surrounded by triple backticks (```).', examples=[{'context': 'The Pythagorean theorem is a fundamental principle in geometry. It states that in a right-angled triangle, the square of the length of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the lengths of the other two sides. This can be written as a^2 + b^2 = c^2 where c represents the length of the hypotenuse, and a and b represent the lengths of the other two sides.', 'output': {'clarity': 3, 'depth': 1, 'structure': 3, 'relevance': 3}}, {'context': 'Albert Einstein (14 March 1879 - 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.', 'output': {'clarity': 3, 'depth': 2, 'structure': 3, 'relevance': 3}}, {'context': "I love chocolate. It's really tasty. Oh, and by the way, the earth orbits the sun, not the other way around. Also, my favorite color is blue.", 'output': {'clarity': 2, 'depth': 1, 'structure': 1, 'relevance': 1}}], input_keys=['context'], output_key='output', output_type='json', language='english')), question_filter=QuestionFilter(llm=LangchainLLMWrapper(run_config=RunConfig(timeout=60, max_retries=15, max_wait=90, max_workers=16, exception_types=<class 'Exception'>)), filter_question_prompt=Prompt(name='filter_question', instruction='\nAsses the given question for clarity and answerability given enough domain knowledge, consider the following criteria:\n1.Independence: Can the question be understood and answered without needing additional context or access to external references not provided within the question itself? Questions should be self-contained, meaning they do not rely on specific documents, tables, or prior knowledge not shared within the question.\n2.Clear Intent: Is it clear what type of answer or information the question seeks? The question should convey its purpose without ambiguity, allowing for a direct and relevant response.\nBased on these criteria, assign a verdict of "1" if a question is specific, independent, and has a clear intent, making it understandable and answerable based on the details provided. Assign "0" if it fails to meet one or more of these criteria due to vagueness, reliance on external references, or ambiguity in intent.\nProvide feedback and a verdict in JSON format, including suggestions for improvement if the question is deemed unclear. Highlight aspects of the question that contribute to its clarity or lack thereof, and offer advice on how it could be reframed or detailed for better understanding and answerability.\n', output_format_instruction='The output should be a well-formatted JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output JSON schema:\n```\n{"type": "object", "properties": {"feedback": {"title": "Feedback", "type": "string"}, "verdict": {"title": "Verdict", "type": "integer"}}, "required": ["feedback", "verdict"]}\n```\n\nDo not return any preamble or explanations, return only a pure JSON string surrounded by triple backticks (```).', examples=[{'question': 'What is the discovery about space?', 'output': {'feedback': "The question is too vague and broad, asking for a 'discovery about space' without specifying any particular aspect, time frame, or context of interest. This could refer to a wide range of topics, from the discovery of new celestial bodies to advancements in space travel technology. To improve clarity and answerability, the question could specify the type of discovery (e.g., astronomical, technological), the time frame (e.g., recent, historical), or the context (e.g., within a specific research study or space mission).", 'verdict': 0}}, {'question': "How does ALMA-13B-R perform compared to other translation models in the WMT'23 study, based on the results in context1 and context2?", 'output': {'feedback': "This question asks for a comparison of the ALMA-13B-R model's performance against other translation models within the WMT'23 study, specifically referring to results in 'context1' and 'context2'. While it clearly specifies the model of interest (ALMA-13B-R) and the study (WMT'23), it assumes access to and understanding of 'context1' and 'context2' without explaining what these contexts entail. This makes the question unclear for those not familiar with the WMT'23 study or these specific contexts. To improve clarity and answerability for a broader audience, the question could benefit from defining or describing 'context1' and 'context2' or explaining the criteria used for comparison in these contexts.", 'verdict': 0}}, {'question': 'How do KIWI-XXL and XCOMET compare to the gold standard references in Table 1 in terms of evaluation scores, translation model performance, and success rate in surpassing the references?', 'output': {'feedback': "The question requests a comparison between KIWI-XXL and XCOMET models and gold standard references in 'Table 1', focusing on evaluation scores, translation model performance, and success rates in surpassing the references. It specifies the models and criteria for comparison, making the intent clear. However, the question assumes access to 'Table 1' without providing its content or context, making it unclear for those without direct access to the source material. To be clearer and more answerable for a general audience, the question could include a brief description of the content or key findings of 'Table 1', or alternatively, frame the question in a way that does not rely on specific, unpublished documents.", 'verdict': 0}}, {'question': 'What is the configuration of UL2 training objective in OpenMoE and why is it a better choice for pre-training?', 'output': {'feedback': 'The question asks for the configuration of the UL2 training objective within the OpenMoE framework and the rationale behind its suitability for pre-training. It is clear in specifying the topic of interest (UL2 training objective, OpenMoE) and seeks detailed information on both the configuration and the reasons for its effectiveness in pre-training. However, the question might be challenging for those unfamiliar with the specific terminology or the context of OpenMoE and UL2. For broader clarity and answerability, it would be helpful if the question included a brief explanation or context about OpenMoE and the UL2 training objective, or clarified the aspects of pre-training effectiveness it refers to (e.g., efficiency, accuracy, generalization).', 'verdict': 1}}, {'question': 'What is the detailed configuration of the UL2 training objective in OpenMoE, based on the provided context?', 'output': {'feedback': "The question seeks detailed information on the UL2 training objective's configuration within the OpenMoE framework, mentioning 'the provided context' without actually including or describing this context within the query. This makes the question unclear for those who do not have access to the unspecified context. For the question to be clear and answerable, it needs to either include the relevant context directly within the question or be framed in a way that does not require external information. Detailing the specific aspects of the configuration of interest (e.g., loss functions, data augmentation techniques) could also help clarify the query.", 'verdict': 0}}], input_keys=['question'], output_key='output', output_type='json', language='english')), question_answer_prompt=Prompt(name='answer_formulate', instruction="Answer the question using the information from the given context. Output verdict as '1' if answer is present '-1' if answer is not present in the context.", output_format_instruction='', examples=[{'context': 'Climate change is significantly influenced by human activities, notably the emission of greenhouse gases from burning fossil fuels. The increased greenhouse gas concentration in the atmosphere traps more heat, leading to global warming and changes in weather patterns.', 'question': 'How do human activities contribute to climate change?', 'answer': {'answer': 'Human activities contribute to climate change primarily through the emission of greenhouse gases from burning fossil fuels. These emissions increase the concentration of greenhouse gases in the atmosphere, which traps more heat and leads to global warming and altered weather patterns.', 'verdict': '1'}}, {'context': 'The concept of artificial intelligence (AI) has evolved over time, but it fundamentally refers to machines designed to mimic human cognitive functions. AI can learn, reason, perceive, and, in some instances, react like humans, making it pivotal in fields ranging from healthcare to autonomous vehicles.', 'question': 'What are the key capabilities of artificial intelligence?', 'answer': {'answer': 'Artificial intelligence is designed to mimic human cognitive functions, with key capabilities including learning, reasoning, perception, and reacting to the environment in a manner similar to humans. These capabilities make AI pivotal in various fields, including healthcare and autonomous driving.', 'verdict': '1'}}, {'context': 'The novel "Pride and Prejudice" by Jane Austen revolves around the character Elizabeth Bennet and her family. The story is set in the 19th century in rural England and deals with issues of marriage, morality, and misconceptions.', 'question': "What year was 'Pride and Prejudice' published?", 'answer': {'answer': 'The answer to given question is not present in context', 'verdict': '-1'}}], input_keys=['context', 'question'], output_key='answer', output_type='json', language='english'), find_relevant_context_prompt=Prompt(name='find_relevant_context', instruction='Given a question and set of contexts, find the most relevant contexts to answer the question.', output_format_instruction='', examples=[{'question': 'What is the capital of France?', 'contexts': ['1. France is a country in Western Europe. It has several cities, including Paris, Lyon, and Marseille. Paris is not only known for its cultural landmarks like the Eiffel Tower and the Louvre Museum but also as the administrative center.', '2. The capital of France is Paris. It is also the most populous city in France, with a population of over 2 million people. Paris is known for its cultural landmarks like the Eiffel Tower and the Louvre Museum.', '3. Paris is the capital of France. It is also the most populous city in France, with a population of over 2 million people. Paris is known for its cultural landmarks like the Eiffel Tower and the Louvre Museum.'], 'output': {'relevant_contexts': [1, 2]}}, {'question': 'How does caffeine affect the body and what are its common sources?', 'contexts': ['1. Caffeine is a central nervous system stimulant. It can temporarily ward off drowsiness and restore alertness. It primarily affects the brain, where it alters the function of neurotransmitters.', '2. Regular physical activity is essential for maintaining good health. It can help control weight, combat health conditions, boost energy, and promote better sleep.', '3. Common sources of caffeine include coffee, tea, cola, and energy drinks. These beverages are consumed worldwide and are known for providing a quick boost of energy.'], 'output': {'relevant_contexts': [1, 2]}}], input_keys=['question', 'contexts'], output_key='output', output_type='json', language='english'), rewrite_invalid_question_prompt=Prompt(name='rewrite_question', instruction='Given a context, question and feedback, rewrite the question to improve its clarity and answerability based on the feedback provided.', output_format_instruction='', examples=[{'context': "The Eiffel Tower was constructed using iron and was originally intended as a temporary exhibit for the 1889 World's Fair held in Paris. Despite its initial temporary purpose, the Eiffel Tower quickly became a symbol of Parisian ingenuity and an iconic landmark of the city, attracting millions of visitors each year. The tower's design, created by Gustave Eiffel, was initially met with criticism from some French artists and intellectuals, but it has since been celebrated as a masterpiece of structural engineering and architectural design.", 'question': 'Who created the design for the Tower?', 'feedback': "The question asks about the creator of the design for 'the Tower', but it does not specify which tower it refers to. There are many towers worldwide, and without specifying the exact tower, the question is unclear and unanswerable. To improve the question, it should include the name or a clear description of the specific tower in question.", 'output': 'Who created the design for the Eiffel Tower?'}, {'context': "'Exploring Zero-Shot Learning in Neural Networks' was published by Smith and Lee in 2021, focusing on the application of zero-shot learning techniques in artificial intelligence.", 'question': 'What datasets were used for the zero-shot evaluations in this study?', 'feedback': "The question asks about the datasets used for zero-shot evaluations in 'this study', without specifying or providing any details about the study in question. This makes the question unclear for those who do not have access to or knowledge of the specific study. To improve clarity and answerability, the question should specify the study it refers to, or provide enough context about the study for the question to be understood and answered independently.", 'output': 'What datasets were used for the zero-shot evaluations Exploring Zero-Shot Learning in Neural Networks paper?'}], input_keys=['context', 'question', 'feedback'], output_key='output', output_type='str', language='english'), max_tries=5, is_async=True, seed_question_prompt=Prompt(name='seed_question', instruction='Generate a question that can be fully answered from given context. The question should be formed using topic', output_format_instruction='', examples=[{'context': 'Photosynthesis in plants involves converting light energy into chemical energy, using chlorophyll and other pigments to absorb light. This process is crucial for plant growth and the production of oxygen.', 'keyphrase': 'Photosynthesis', 'question': 'What is the role of photosynthesis in plant growth?'}, {'context': 'The Industrial Revolution, starting in the 18th century, marked a major turning point in history as it led to the development of factories and urbanization.', 'keyphrase': 'Industrial Revolution', 'question': 'How did the Industrial Revolution mark a major turning point in history?'}, {'context': 'The process of evaporation plays a crucial role in the water cycle, converting water from liquid to vapor and allowing it to rise into the atmosphere.', 'keyphrase': 'Evaporation', 'question': 'Why is evaporation important in the water cycle?'}], input_keys=['context', 'keyphrase'], output_key='question', output_type='str', language='english'))

R-290

omkar-334 commented 6 months ago

Could the problem lie with distributions ? As far as I know, the sum of all distributions must add up to 1.

JPonsa commented 6 months ago

@omkar-334, thanks for spotting that. This was a typo when copying the code into the ticket. I am passing them as an argument to my script. In my code it is correct and adding to 1. I corrected the info in the ticket description

ss7424Refar commented 5 months ago

same issue . i think if context_scoring_parser = RagasoutputParser(pydantic_object=ContextScoring) in output_parser.py return None , means llm cannot gen result by right prompt , it will cause this exception.

also , if scoring is None . https://github.com/explodinggradients/ragas/blob/9bc6e6fd44180e658751e15da3a3829c957ee853/src/ragas/testset/filters.py#L60 , it will cause ZeroDivisionError: division by zero

pls fix this.

sroecker commented 4 months ago

same issue . i think if context_scoring_parser = RagasoutputParser(pydantic_object=ContextScoring) in output_parser.py return None , means llm cannot gen result by right prompt , it will cause this exception.

also , if scoring is None .

https://github.com/explodinggradients/ragas/blob/9bc6e6fd44180e658751e15da3a3829c957ee853/src/ragas/testset/filters.py#L60

, it will cause ZeroDivisionError: division by zero pls fix this.

Exactly, this is where it fails. Models like Llama 3 or Mixtral 8x7B are able to output JSON but sometimes fail to do so with the context_scoring_prompt.

LDelPinoNT commented 4 months ago

I'm interested to get a solution for this.

samiislam commented 4 months ago

On top of the above mentioned problem, the RagasoutputParser delegates the json parsing to the underlying json.py->parse_json_markdown from langchain which in my case is unable to parse the json markup section within the json_string. The problem is that the _json_mardown_re used matches multiple triple backticks and the json_str = match.group(2) does not return the required context_scoring json object representation as string.

It does not look like that the '_json_mardown_re' defined in the json.py from langchain core is correct. If I change the regex from:

_json_markdown_re = re.compile(r"```(json)?(.*)", re.DOTALL)

to

_json_markdown_re = re.compile(r"```json([\s\S]*?)```", re.DOTALL)

and use:

json_str = match.group(1)

then some proccesing happens but at the end the testset returned by generator.generate_with_langchain_docs is empty.

jjmachan commented 3 months ago

@samiislam @ss7424Refar thank you so much for the suggestion for improvements. I will take this up next week and add these in to improve the JSONParser.

@LDelPinoNT @sroecker @JPonsa applogies for the delay but I'll keep you all in the loop about this. I might need a small help from your end though, to share some examples of model outputs the parser if failing at. I'll write some tests around it too

LDelPinoNT commented 3 months ago

In my end the amount of "Failed to parse output" was reduced after upgrading to 0.1.11. (Before I have this error for every row in my dataset). I think that now is related to errors of the LM that sometime the generated JSON isn't perfect.

jjmachan commented 3 months ago

@LDelPinoNT that is good to know, may I ask which models you are using?

matheusft commented 3 months ago

I am getting an “Failed to parse output. Returning None” error. I I have tried llama3 and mistral8x7b. I believe both models should be able to generate json like outputs.

I need advice on how to solve this.

This could be related to #859


    splitter = RecursiveJsonSplitter(max_chunk_size=2_000)
    docs = splitter.create_documents(texts=studies)

    generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)

      eval_ds = generator.generate_with_langchain_docs(
        docs,
        test_size=50,
        distributions={
            simple: 0.4,
            reasoning: 0.4,
            multi_context: 0.2,
        },
        raise_exceptions=True,
        is_async=True # as per https://github.com/explodinggradients/ragas/issues/709
    )
    eval_ds.to_pandas().to_csv(args.output)

Error: Note: I had to trim the SimpleEvolution message as it was reporting the content of many documents and their embedding.

Failed to parse output. Returning None.
Failed to parse output. Returning None.
max retries exceeded for SimpleEvolution(generator_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=60, max_retries=15, max_wait=90, max_workers=16, exception_types=<class 'Exception'>)), docstore=InMemoryDocumentStore(splitter=<langchain_text_splitters.base.TokenTextSplitter object at 0x2b813d22f410>, nodes=[Node(page_content="NCT00000173: protocolSection: identificationModule: nctId: NCT00000173, organization: fullName: National Institute on Aging (NIA), class: NIH, briefTitle: Memory Impairment Study (Mild Cognitive Impairment Study), officialTitle: A Randomized, Double-Blind, Placebo-Controlled Trial of Vitamin E and Donepezil HCL (Aricept) to Delay Clinical Progression From Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD), statusModule: overallStatus: COMPLETED, sponsorCollaboratorsModule: leadSponsor: name: National Institute on Aging (NIA), class: NIH, descriptionModule: briefSummary: The National Institute on Aging (NIA) is launching a nationwide treatment study targeting individuals with mild cognitive impairment (MCI), a condition characterized by a memory deficit, but not dementia. An NIA-funded study recently confirmed that MCI is different from both dementia and normal age-related changes in memory. Accurate and early evaluation and treatment of MCI individuals might prevent further cognitive decline, including development of Alzheimer's disease (AD). The Memory Impairment Study is the first such AD prevention clinical trial carried out by NIH, and will be conducted at 65-80 medical research institutions located in the United States and Canada. This study will test the usefulness of two drugs to slow or stop the conversion from MCI to AD. The trial will evaluate placebo, vitamin E, and donepezil, an investigational agent approved by the Food and Drug Administration for another use. Vitamin E (alpha-tocopherol) is thought to have antioxidant properties, and was shown in a 1997 study to delay important dementia milestones, such as patients' institutionalization or progression to severe dementia, by about seven months.", metadata={'filename': 'NCT00000173'}, doc_id='4a458ce7-7dd9-41c2-98f4-ddc032b683b7'), Node(page_content="NCT00000173: protocolSection: conditionsModule: conditions: Alzheimer Disease; keywords: Mild cognitive impairment, Alzheimer's disease, Memory, Donepezil, Vitamin E, Antioxidants, Cholinergic agents, Cholinesterase inhibitors; designModule: studyType: INTERVENTIONAL, phases: PHASE3; designInfo: allocation: RANDOMIZED, interventionModel: PARALLEL, primaryPurpose: TREATMENT, maskingInfo: , armsInterventionsModule: interventions: type: DRUG, name: Donepezil, type: DRUG, name: Vitamin E; eligibilityModule: eligibilityCriteria: Inclusion Criteria: * Memory complaints and memory difficulties which are verified by an informant. * Abnormal memory function documented by scoring below the education adjusted cutoff on the Logical Memory II subscale (Delayed Paragraph Recall) from the Wechsler Memory Scale - Revised (the maximum score is 25): a) less than or equal to 8 for 16 or more years of education, b) less than or equal to 4 for 8-15 years of education, c) less than or equal to 2 for 0-7 years of education. * Mini-Mental Exam score between 24 and 30 (inclusive) (Exceptions may be made for subjects with less than 8 years of education at the discretion of the project director.). * Clinical Dementia Rating = 0.5. Memory Box score must be at least 0.5. * General cognition and functional performance sufficiently preserved such that a diagnosis of Alzheimer's disease cannot be made by the site physician at the time of the screening visit. * No significant cerebrovascular disease: Modified Hachinski score of less than or equal to 4. * Age between 55 and 90 (inclusive). * Permitted medications stable for at least 1 month prior to screening. In particular: a) Subjects may take stable doses of antidepressants lacking significant anticholinergic side effects (if they are not currently depressed and do not have a history of major depression within the past 2 years). b) Estrogen replacement therapy is permissible. c) Ginkgo biloba is permissible, but discouraged. * Hamilton Depression rating scale score of less than or equal to 12 on the 17-item scale. * Informant is available who has frequent contact with the subject (e.g. an average of 10 hours per week or more), agrees to monitor administration of study drug, observe for adverse events, and accompany the subject to all clinic visits for the duration of the protocol. * CT or MRI scans within 12 months prior to screening without evidence of infection, infarction, or other focal lesions and without clinical symptoms suggestive of intervening neurological disease. A lacune in a non-critical brain area which is not believed to contribute to the subject's cognitive impairment is permissible. * Adequate visual and auditory acuity to allow neuropsychological testing. * Good general health with no additional diseases expected to interfere with the study. * Normal B12, RPR, and Thyroid Function Tests or without any clinically significant abnormalities that would be expected to interfere with the study. * ECG without clinically significant abnormalities that would be expected to interfere with the study. * Subject is not pregnant, lactating, or of childbearing potential (i.e. women must be two years post-menopausal or surgically sterile). * Agreement not to take other vitamin supplements (including Vitamin E), multivitamins, other than those provided by the study. Exclusion Criteria: * Any significant neurologic disease other than suspected incipient Alzheimer's disease, such as Parkinson's disease, multi-infarct dementia, Huntington's disease, normal pressure hydrocephalus, brain tumor, progressive supranuclear palsy, seizure disorder, subdural hematoma, multiple sclerosis, or history of significant head trauma followed by persistent neurologic defaults or known structural brain abnormalities. * Major depression or another major psychiatric disorder as described in DSM IV within the past 2 years. * Psychotic features, agitation or behavioral problems within the last 3 months which could lead to difficulty complying with the protocol. * History of alcohol or substance abuse or dependence within the past 2 years (DSM IV criteria). * History of schizophrenia (DSM IV criteria). * Any significant systemic illness or unstable medical condition which could lead to difficulty complying with the protocol including: a) History of systemic cancer within the last 5 years (non-metastatic skin cancers are acceptable). b) History of myocardial infarction within the past year or unstable or severe cardiovascular disease including angina or CHF with symptoms at rest. c) Clinically significant obstructive pulmonary disease or asthma. d) Clinically significant and unstable gastrointestinal disorder such as ulcer disease or a history of active or occult gastrointestinal bleeding within two years. e) Clinically significant laboratory test abnormalities on the battery of screening tests (hematology, prothrombin time, chemistry, urinalysis, ECG). f) Insulin", metadata={'filename': 'NCT00000173'}, doc_id='6de53b72-0e35-4864-8a27-f3524bbe6a90', wins=2), [ mode documents]..., metadata={'filename': 'NCT00000938'}, doc_id='fa58d93c-8e58-4898-93de-59402c15503e')], node_embeddings_list=[[-0.034423138946294785, , 0.01802566833794117]], node_map={'4a458ce7-7dd9-41c2-98f4-ddc032b683b7': 

[...]

node_filter=NodeFilter(llm=LangchainLLMWrapper(run_config=RunConfig(timeout=60, max_retries=15, max_wait=90, max_workers=16, exception_types=<class 'Exception'>)), threshold=1.5, context_scoring_prompt=Prompt(name='score_context', instruction='\n    Given a context, perform the following task and output the answer in VALID JSON format: Assess the provided context and assign a numerical score of 1 (Low), 2 (Medium), or 3 (High) for each of the following criteria in your JSON response:\n\nclarity: Evaluate the precision and understandability of the information presented. High scores (3) are reserved for contexts that are both precise in their information and easy to understand. Low scores (1) are for contexts where the information is vague or hard to comprehend.\ndepth: Determine the level of detailed examination and the inclusion of innovative insights within the context. A high score indicates a comprehensive and insightful analysis, while a low score suggests a superficial treatment of the topic.\nstructure: Assess how well the content is organized and whether it flows logically. High scores are awarded to contexts that demonstrate coherent organization and logical progression, whereas low scores indicate a lack of structure or clarity in progression.\nrelevance: Judge the pertinence of the content to the main topic, awarding high scores to contexts tightly focused on the subject without unnecessary digressions, and low scores to those that are cluttered with irrelevant information.\nStructure your JSON output to reflect these criteria as keys with their corresponding scores as values\n    ', output_format_instruction='The output should be a well-formatted JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output JSON schema:\n```\n{"type": "object", "properties": {"clarity": {"title": "Clarity", "type": "integer"}, "depth": {"title": "Depth", "type": "integer"}, "structure": {"title": "Structure", "type": "integer"}, "relevance": {"title": "Relevance", "type": "integer"}}, "required": ["clarity", "depth", "structure", "relevance"]}\n```\n\nDo not return any preamble or explanations, return only a pure JSON string surrounded by triple backticks (```).', examples=[{'context': 'The Pythagorean theorem is a fundamental principle in geometry. It states that in a right-angled triangle, the square of the length of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the lengths of the other two sides. This can be written as a^2 + b^2 = c^2 where c represents the length of the hypotenuse, and a and b represent the lengths of the other two sides.', 'output': {'clarity': 3, 'depth': 1, 'structure': 3, 'relevance': 3}}, {'context': 'Albert Einstein (14 March 1879 - 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.', 'output': {'clarity': 3, 'depth': 2, 'structure': 3, 'relevance': 3}}, {'context': "I love chocolate. It's really tasty. Oh, and by the way, the earth orbits the sun, not the other way around. Also, my favorite color is blue.", 'output': {'clarity': 2, 'depth': 1, 'structure': 1, 'relevance': 1}}], input_keys=['context'], output_key='output', output_type='json', language='english')), question_filter=QuestionFilter(llm=LangchainLLMWrapper(run_config=RunConfig(timeout=60, max_retries=15, max_wait=90, max_workers=16, exception_types=<class 'Exception'>)), filter_question_prompt=Prompt(name='filter_question', instruction='\nAsses the given question for clarity and answerability given enough domain knowledge, consider the following criteria:\n1.Independence: Can the question be understood and answered without needing additional context or access to external references not provided within the question itself? Questions should be self-contained, meaning they do not rely on specific documents, tables, or prior knowledge not shared within the question.\n2.Clear Intent: Is it clear what type of answer or information the question seeks? The question should convey its purpose without ambiguity, allowing for a direct and relevant response.\nBased on these criteria, assign a verdict of "1" if a question is specific, independent, and has a clear intent, making it understandable and answerable based on the details provided. Assign "0" if it fails to meet one or more of these criteria due to vagueness, reliance on external references, or ambiguity in intent.\nProvide feedback and a verdict in JSON format, including suggestions for improvement if the question is deemed unclear. Highlight aspects of the question that contribute to its clarity or lack thereof, and offer advice on how it could be reframed or detailed for better understanding and answerability.\n', output_format_instruction='The output should be a well-formatted JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output JSON schema:\n```\n{"type": "object", "properties": {"feedback": {"title": "Feedback", "type": "string"}, "verdict": {"title": "Verdict", "type": "integer"}}, "required": ["feedback", "verdict"]}\n```\n\nDo not return any preamble or explanations, return only a pure JSON string surrounded by triple backticks (```).', examples=[{'question': 'What is the discovery about space?', 'output': {'feedback': "The question is too vague and broad, asking for a 'discovery about space' without specifying any particular aspect, time frame, or context of interest. This could refer to a wide range of topics, from the discovery of new celestial bodies to advancements in space travel technology. To improve clarity and answerability, the question could specify the type of discovery (e.g., astronomical, technological), the time frame (e.g., recent, historical), or the context (e.g., within a specific research study or space mission).", 'verdict': 0}}, {'question': "How does ALMA-13B-R perform compared to other translation models in the WMT'23 study, based on the results in context1 and context2?", 'output': {'feedback': "This question asks for a comparison of the ALMA-13B-R model's performance against other translation models within the WMT'23 study, specifically referring to results in 'context1' and 'context2'. While it clearly specifies the model of interest (ALMA-13B-R) and the study (WMT'23), it assumes access to and understanding of 'context1' and 'context2' without explaining what these contexts entail. This makes the question unclear for those not familiar with the WMT'23 study or these specific contexts. To improve clarity and answerability for a broader audience, the question could benefit from defining or describing 'context1' and 'context2' or explaining the criteria used for comparison in these contexts.", 'verdict': 0}}, {'question': 'How do KIWI-XXL and XCOMET compare to the gold standard references in Table 1 in terms of evaluation scores, translation model performance, and success rate in surpassing the references?', 'output': {'feedback': "The question requests a comparison between KIWI-XXL and XCOMET models and gold standard references in 'Table 1', focusing on evaluation scores, translation model performance, and success rates in surpassing the references. It specifies the models and criteria for comparison, making the intent clear. However, the question assumes access to 'Table 1' without providing its content or context, making it unclear for those without direct access to the source material. To be clearer and more answerable for a general audience, the question could include a brief description of the content or key findings of 'Table 1', or alternatively, frame the question in a way that does not rely on specific, unpublished documents.", 'verdict': 0}}, {'question': 'What is the configuration of UL2 training objective in OpenMoE and why is it a better choice for pre-training?', 'output': {'feedback': 'The question asks for the configuration of the UL2 training objective within the OpenMoE framework and the rationale behind its suitability for pre-training. It is clear in specifying the topic of interest (UL2 training objective, OpenMoE) and seeks detailed information on both the configuration and the reasons for its effectiveness in pre-training. However, the question might be challenging for those unfamiliar with the specific terminology or the context of OpenMoE and UL2. For broader clarity and answerability, it would be helpful if the question included a brief explanation or context about OpenMoE and the UL2 training objective, or clarified the aspects of pre-training effectiveness it refers to (e.g., efficiency, accuracy, generalization).', 'verdict': 1}}, {'question': 'What is the detailed configuration of the UL2 training objective in OpenMoE, based on the provided context?', 'output': {'feedback': "The question seeks detailed information on the UL2 training objective's configuration within the OpenMoE framework, mentioning 'the provided context' without actually including or describing this context within the query. This makes the question unclear for those who do not have access to the unspecified context. For the question to be clear and answerable, it needs to either include the relevant context directly within the question or be framed in a way that does not require external information. Detailing the specific aspects of the configuration of interest (e.g., loss functions, data augmentation techniques) could also help clarify the query.", 'verdict': 0}}], input_keys=['question'], output_key='output', output_type='json', language='english')), question_answer_prompt=Prompt(name='answer_formulate', instruction="Answer the question using the information from the given context. Output verdict as '1' if answer is present '-1' if answer is not present in the context.", output_format_instruction='', examples=[{'context': 'Climate change is significantly influenced by human activities, notably the emission of greenhouse gases from burning fossil fuels. The increased greenhouse gas concentration in the atmosphere traps more heat, leading to global warming and changes in weather patterns.', 'question': 'How do human activities contribute to climate change?', 'answer': {'answer': 'Human activities contribute to climate change primarily through the emission of greenhouse gases from burning fossil fuels. These emissions increase the concentration of greenhouse gases in the atmosphere, which traps more heat and leads to global warming and altered weather patterns.', 'verdict': '1'}}, {'context': 'The concept of artificial intelligence (AI) has evolved over time, but it fundamentally refers to machines designed to mimic human cognitive functions. AI can learn, reason, perceive, and, in some instances, react like humans, making it pivotal in fields ranging from healthcare to autonomous vehicles.', 'question': 'What are the key capabilities of artificial intelligence?', 'answer': {'answer': 'Artificial intelligence is designed to mimic human cognitive functions, with key capabilities including learning, reasoning, perception, and reacting to the environment in a manner similar to humans. These capabilities make AI pivotal in various fields, including healthcare and autonomous driving.', 'verdict': '1'}}, {'context': 'The novel "Pride and Prejudice" by Jane Austen revolves around the character Elizabeth Bennet and her family. The story is set in the 19th century in rural England and deals with issues of marriage, morality, and misconceptions.', 'question': "What year was 'Pride and Prejudice' published?", 'answer': {'answer': 'The answer to given question is not present in context', 'verdict': '-1'}}], input_keys=['context', 'question'], output_key='answer', output_type='json', language='english'), find_relevant_context_prompt=Prompt(name='find_relevant_context', instruction='Given a question and set of contexts, find the most relevant contexts to answer the question.', output_format_instruction='', examples=[{'question': 'What is the capital of France?', 'contexts': ['1. France is a country in Western Europe. It has several cities, including Paris, Lyon, and Marseille. Paris is not only known for its cultural landmarks like the Eiffel Tower and the Louvre Museum but also as the administrative center.', '2. The capital of France is Paris. It is also the most populous city in France, with a population of over 2 million people. Paris is known for its cultural landmarks like the Eiffel Tower and the Louvre Museum.', '3. Paris is the capital of France. It is also the most populous city in France, with a population of over 2 million people. Paris is known for its cultural landmarks like the Eiffel Tower and the Louvre Museum.'], 'output': {'relevant_contexts': [1, 2]}}, {'question': 'How does caffeine affect the body and what are its common sources?', 'contexts': ['1. Caffeine is a central nervous system stimulant. It can temporarily ward off drowsiness and restore alertness. It primarily affects the brain, where it alters the function of neurotransmitters.', '2. Regular physical activity is essential for maintaining good health. It can help control weight, combat health conditions, boost energy, and promote better sleep.', '3. Common sources of caffeine include coffee, tea, cola, and energy drinks. These beverages are consumed worldwide and are known for providing a quick boost of energy.'], 'output': {'relevant_contexts': [1, 2]}}], input_keys=['question', 'contexts'], output_key='output', output_type='json', language='english'), rewrite_invalid_question_prompt=Prompt(name='rewrite_question', instruction='Given a context, question and feedback, rewrite the question to improve its clarity and answerability based on the feedback provided.', output_format_instruction='', examples=[{'context': "The Eiffel Tower was constructed using iron and was originally intended as a temporary exhibit for the 1889 World's Fair held in Paris. Despite its initial temporary purpose, the Eiffel Tower quickly became a symbol of Parisian ingenuity and an iconic landmark of the city, attracting millions of visitors each year. The tower's design, created by Gustave Eiffel, was initially met with criticism from some French artists and intellectuals, but it has since been celebrated as a masterpiece of structural engineering and architectural design.", 'question': 'Who created the design for the Tower?', 'feedback': "The question asks about the creator of the design for 'the Tower', but it does not specify which tower it refers to. There are many towers worldwide, and without specifying the exact tower, the question is unclear and unanswerable. To improve the question, it should include the name or a clear description of the specific tower in question.", 'output': 'Who created the design for the Eiffel Tower?'}, {'context': "'Exploring Zero-Shot Learning in Neural Networks' was published by Smith and Lee in 2021, focusing on the application of zero-shot learning techniques in artificial intelligence.", 'question': 'What datasets were used for the zero-shot evaluations in this study?', 'feedback': "The question asks about the datasets used for zero-shot evaluations in 'this study', without specifying or providing any details about the study in question. This makes the question unclear for those who do not have access to or knowledge of the specific study. To improve clarity and answerability, the question should specify the study it refers to, or provide enough context about the study for the question to be understood and answered independently.", 'output': 'What datasets were used for the zero-shot evaluations Exploring Zero-Shot Learning in Neural Networks paper?'}], input_keys=['context', 'question', 'feedback'], output_key='output', output_type='str', language='english'), max_tries=5, is_async=True, seed_question_prompt=Prompt(name='seed_question', instruction='Generate a question that can be fully answered from given context. The question should be formed using topic', output_format_instruction='', examples=[{'context': 'Photosynthesis in plants involves converting light energy into chemical energy, using chlorophyll and other pigments to absorb light. This process is crucial for plant growth and the production of oxygen.', 'keyphrase': 'Photosynthesis', 'question': 'What is the role of photosynthesis in plant growth?'}, {'context': 'The Industrial Revolution, starting in the 18th century, marked a major turning point in history as it led to the development of factories and urbanization.', 'keyphrase': 'Industrial Revolution', 'question': 'How did the Industrial Revolution mark a major turning point in history?'}, {'context': 'The process of evaporation plays a crucial role in the water cycle, converting water from liquid to vapor and allowing it to rise into the atmosphere.', 'keyphrase': 'Evaporation', 'question': 'Why is evaporation important in the water cycle?'}], input_keys=['context', 'keyphrase'], output_key='question', output_type='str', language='english'))

R-290

I'm getting the same when doing

    from ragas import evaluate

    result = evaluate(
        dataset=ragas_eval_dataset,
        metrics=benchmarking_metrics_list,
        llm=critic_llm,
        embeddings=embedding_model,
        raise_exceptions=False
    )

However, I get much more of “Failed to parse output. Returning None” warning when my critic_llm is Palm 2 (text-bison) compared to when my critic_llm is Gemini Pro (gemini-1.5-pro)

alanaziyasir commented 3 months ago

Using llama3.1, I found the error related to large context size. With small context size, it works fine and produce the metrics

bdytx5 commented 1 month ago

any updates on this?

jjmachan commented 1 month ago

this was fixing in v0.2 - are you see the same? @bdytx5 ?

bdytx5 commented 1 month ago

yeah, with claude 3.5 sonnet

jjmachan commented 1 month ago

is it every row or just some rows, is it repeatedly failing for the same datapoints? - would it be possible to share the data?

I ran it locally with a dataset we have and it was working the model I used

from langchain_anthropic import ChatAnthropic
sonnet35 = ChatAnthropic(
    model="claude-3-5-sonnet-20240620",
    temperature=0
)
bdytx5 commented 1 month ago

yeah, using metrics = [ LLMContextRecall(), FactualCorrectness(), Faithfulness(), LLMContextPrecisionWithoutReference(), NoiseSensitivity(), ResponseRelevancy(), ContextEntityRecall() ] results = evaluate(dataset=eval_dataset, metrics=metrics, llm=evaluator_llm, run_config=my_run_config) with the LangchainLLMWrapper(ChatAnthropic(model="claude-3-5-sonnet-20240620")) model gives the error on some datapoints, but not all.... Heres the data
results_claude-3-5-sonnet-20240620_cohere.csv

jjmachan commented 1 month ago

@bdytx5 so I checked the traces and so that there where a few retries that are happening due to max_token issue image

a few are due to incorrect small mistakes in the JSON outputed.

I increased the max_token output and that fixed most of it actually

from langchain_anthropic import ChatAnthropic
sonnet35 = ChatAnthropic(
    model="claude-3-5-sonnet-20240620",
    temperature=0,
    max_tokens=4096
)

so I'll do the following next steps

  1. throw the error when model exceeds the max_token limit so users understand the root cause
  2. improve the Output parser - this is bit more vague but let me try a few straightforward experiments
bdytx5 commented 1 month ago

nice! Yeah a max_token error would be very helpful!

jjmachan commented 1 month ago

the original issue is now much improved with #1541 closing this for now but do test the latest release and see if it solves the issues