SummarizationMetric changing score and reasoning methodology at random

Describe the bug

Summarization scores change depending on:
- the type of assessment questions being asked
- the number of assessment questions provided
- whether a developer provides the questions or they are generated by the evaluator LLM
The scores stay constant when there are no changes in the above parameters, indicating some sort of unintended caching in LLM memory, despite it being clearly mentioned at the top of the DeepEval Summarization page: https://docs.confident-ai.com/docs/metrics-summarization
The reasoning behind scores are dynamic. It sometimes assigns the score 0 when there are no contradictions, unclear answers or hallucinations, whereas the documentation expects the score 1 (or a value greater than 0) in such cases

To Reproduce Steps to reproduce the behavior:

from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase

# This is the original text to be summarized
input = """
The 'coverage score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""

# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The coverage score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""

assessment_questions_3 = [
        "Is the coverage score based on a percentage of 'yes' answers?",
        "Does the score ensure the summary's accuracy with the source?",
        "Does a higher score mean a more comprehensive summary?"
    ]

assessment_questions_5 = ["Is the coverage score calculated as a percentage of assessment questions?", 
                            "Does the coverage score reflect how well a summary represents the original text?", 
                            "Does a higher coverage score indicate a more comprehensive summary?", 
                            "Does the coverage score quantify how poorly a summary captures key information?", 
                            "Does a higher coverage score signify that a summary effectively encapsulates crucial points?"]
}

test_case_3questions = LLMTestCase(input=input, actual_output=actual_output)
metric_3q = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    assessment_questions=assessment_questions_3
)

metric_3q.measure(test_case)
print(metric_3q.score)
print(metric_3q.reason)

test_case_5questions = LLMTestCase(input=input, actual_output=actual_output)
metric_5q = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    assessment_questions=assessment_questions_5
)

metric_5q.measure(test_case)
print(metric_5q.score)
print(metric_5q.reason)

test_case_5own = LLMTestCase(input=input, actual_output=actual_output)
metric_5ownq = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    n=5
)

metric_5ownq.measure(test_case)
print(metric_5ownq.score)
print(metric_5ownq.reason)

test_case_3own = LLMTestCase(input=input, actual_output=actual_output)
metric_3q = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    n=3
)

metric_3ownq.measure(test_case)
print(metric_3ownq.score)
print(metric_3ownq.reason)

Expected behavior

Providing or not providing our own assessment questions should not change the score or reason
Changing the number of assessment questions should not change the score or reason
The reasoning methodology must be clarified in the prompt to the LLM: A response of 0 indicates that the summary is bad and a response of 1 indicates that the summary is very good.

Screenshots