jjmachan commented 8 months ago

Problem - ragas is slow and unreliable

Ragas is not exploiting concurrency options provided via ThreadPoolExecutor and asyncio modules. This is because ragas took a batching approach to evaluation ie evaluated metrics in batches
Not every service has async support - need options to keep sync and no concurrency at all
need these primitives for #380 and potentially others as well

Core Components

BaseMetric - a metric that evaluates a single score row with but score() and ascore()
RagasLLM that is based on langchain-core llms
1. Prompt object with provision for instruction and demonstrations that convert to messages or prompts that is supported by both langchain chat based on completion based
2. LLMResult object that supports both chat and text-based outputs
Exector that runs BaseMetric . It should also be able to run testset generators so this should be a common paradigm
new evaluate() function that makes it easier to
1. change llm and embeddings - this is the new method where BaseMetrc by default will have llm=None and will take the default llm from the evaluate() function. If metric.llm != None then the provided metric is used
2. switch between async vs threading
3. supports callbacks throughout

Base classes

`Metric`

class Metric:
    def score(
      row, # just 1 row
      callbacks: t.Optional[Callbacks] = None,
    )-> float:
            ...
    async def ascore(
        row, # just 1 row
        callbacks: t.Optional[Callbacks] = None,
    )-> float:
        ...

`evaluation()`

def evaluate(
    dataset: Dataset,
    metrics: list[Metric] | None = None,
    llm: t.Optional[BaseRagasLLM] = None,
    embeddings: t.Optional[RagasEmbeddings] = None,
    callbacks: Callbacks = [],
    is_async: bool = True,
    max_workers: t.Optional[int] = None,
    raise_exceptions: bool = True,
    column_map: t.Dict[str, str] = {},
) -> Result:

`BaseRagasLLM`

@dataclass
class BaseRagasLLM(ABC):
    @abstractmethod
    def generate_text(
        self,
        prompt: Prompt,
        n: int = 1,
        temperature: float = 1e-8,
        stop: t.Optional[t.List[str]] = None,
        callbacks: t.Optional[Callbacks] = None,
    ) -> LLMResult:
        ...

    @abstractmethod
    async def agenerate_text(
        self,
        prompt: Prompt,
        n: int = 1,
        temperature: float = 1e-8,
        stop: t.Optional[t.List[str]] = None,
        callbacks: t.Optional[Callbacks] = None,
    ) -> LLMResult:
        ...

jjmachan commented 8 months ago

list of issues this will address

387
383
271
303
330
376
343
375
282
369
367
108

removing HF dataset
286
413
414

Make embeddings faster

361

iterakhtaras commented 5 months ago

Hey @jjmachan ! Thanks for all your work on ragas, I really appreciate it. I am trying to use it to evaluate my chatbot created with llama-index. Has there been any workarounds discovered for issue #271 ?

These are my dependencies: `%pip install ragas==0.0.22

%pip install pypdf

%pip install llama-index==0.8.52

%pip install langchain==0.0.331rc3

%pip install openai==0.28.1`

explodinggradients / ragas

[RFC] Executor: making Ragas faster and more reliable #394