Open Rugved2204 opened 1 month ago
To resolve the ValueError: a cannot be empty unless no samples are taken
error when generating a synthetic testset using the Ragas library, ensure that the test_size
parameter is greater than zero and that the docstore
contains enough documents to sample from. The error typically occurs when the list of documents to sample from is empty.
Here is the relevant part of the code where the error might be occurring:
current_nodes = [
CurrentNodes(root_node=n, nodes=[n])
for n in self.docstore.get_random_nodes(k=test_size)
]
Make sure that self.docstore.get_random_nodes(k=test_size)
returns a non-empty list. If test_size
is greater than the number of available documents, this could lead to the error.
Here is a potential fix to ensure that test_size
is valid and that there are enough documents in the docstore
:
def generate(
self,
test_size: int,
distributions: t.Optional[Distributions] = None,
with_debugging_logs=False,
is_async: bool = True,
raise_exceptions: bool = True,
run_config: t.Optional[RunConfig] = None,
):
distributions = distributions or DEFAULT_DISTRIBUTION
# validate distributions
if not check_if_sum_is_close(list(distributions.values()), 1.0, 3):
raise ValueError(
f"distributions passed do not sum to 1.0 [got {sum(list(distributions.values()))}]. Please check the "
f"distributions."
)
# configure run_config for docstore
if run_config is None:
run_config = RunConfig(max_retries=15, max_wait=90)
self.docstore.set_run_config(run_config)
# init filters and evolutions
for evolution in distributions:
self.init_evolution(evolution)
evolution.init(is_async=is_async, run_config=run_config)
if with_debugging_logs:
from ragas.utils import patch_logger
patch_logger("ragas.testset.evolutions", logging.DEBUG)
patch_logger("ragas.testset.extractor", logging.DEBUG)
patch_logger("ragas.testset.filters", logging.DEBUG)
patch_logger("ragas.testset.docstore", logging.DEBUG)
patch_logger("ragas.llms.prompt", logging.DEBUG)
exec = Executor(
desc="Generating",
keep_progress_bar=True,
raise_exceptions=raise_exceptions,
run_config=run_config,
)
available_nodes = self.docstore.get_random_nodes(k=test_size)
if not available_nodes:
raise ValueError("No documents available in the docstore to sample from.")
current_nodes = [
CurrentNodes(root_node=n, nodes=[n])
for n in available_nodes
]
total_evolutions = 0
for evolution, probability in distributions.items():
for i in sample(range(test_size), round(probability * test_size)):
exec.submit(
evolution.evolve,
current_nodes[i],
name=f"{evolution.__class__.__name__}-{i}",
)
total_evolutions += 1
if total_evolutions <= test_size:
filler_evolutions = choices(
list(distributions), k=test_size - total_evolutions
)
for evolution in filler_evolutions:
exec.submit(
evolution.evolve,
current_nodes[total_evolutions],
name=f"{evolution.__class__.__name__}-{total_evolutions}",
)
total_evolutions += 1
try:
test_data_rows = exec.results()
if not test_data_rows:
raise ExceptionInRunner()
except ValueError as e:
raise e
# make sure to ignore any NaNs that might have been returned
# due to failed evolutions. MaxRetriesExceeded is a common reason
test_data_rows = [r for r in test_data_rows if not is_nan(r)]
test_dataset = TestDataset(test_data=test_data_rows)
evol_lang = [get_feature_language(e) for e in distributions]
evol_lang = [e for e in evol_lang if e is not None]
track(
TestsetGenerationEvent(
event_type="testset_generation",
evolution_names=[e.__class__.__name__.lower() for e in distributions],
evolution_percentages=[distributions[e] for e in distributions],
num_rows=len(test_dataset.test_data),
language=evol_lang[0] if len(evol_lang) > 0 else "",
)
)
return test_dataset
This code ensures that there are enough documents in the docstore
before attempting to generate the testset [1].
To continue talking to Dosu, mention @dosu.
@dosu
Traceback (most recent call last):
File "/home/ubuntu/scp-analyzer/rag-old/test_ragas.py", line 31, in
The error still persists even after changing the code.
The error persists because the self.docstore.get_random_nodes(k=test_size)
method is still returning an empty list. This indicates that there are no nodes available in the docstore
to sample from. Here are a few steps to diagnose and resolve the issue:
Check Document Loading:
Ensure that the PubMedLoader
is correctly loading documents and that these documents are being added to the docstore
.
Verify Node Addition:
Confirm that the nodes are being correctly added to the docstore
. You can add some debug prints to check the number of nodes in the docstore
after loading documents.
Inspect get_random_nodes
Method:
Review the get_random_nodes
method to ensure it is functioning correctly and that the probability distribution is valid.
Here is a modified version of your script with additional debug prints to help diagnose the issue:
loader = PubMedLoader("liver", load_max_docs=10)
documents = loader.load()
print(f"Loaded {len(documents)} documents")
embedding_model_name = 'BAAI/bge-small-en-v1.5'
embeddings = HuggingFaceEmbeddings(
model_name=embedding_model_name,
model_kwargs={'device':'cuda:0'}
)
model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
response_generation_llm = load_model(model_name, 0.1, 1024)
generator = TestsetGenerator.from_langchain(
response_generation_llm,
response_generation_llm,
embeddings
)
# Add documents to the docstore
generator.docstore.add_documents(documents)
print(f"Docstore contains {len(generator.docstore.nodes)} nodes")
# Attempt to generate the testset
try:
testset = generator.generate_with_langchain_docs(documents, test_size=2, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})
print("Testset generated successfully")
except ValueError as e:
print(f"Error: {e}")
This script will print the number of documents loaded and the number of nodes in the docstore
. If the number of nodes is zero, it indicates that the documents are not being correctly added to the docstore
.
Additionally, ensure that the get_random_nodes
method in the docstore
class is correctly implemented and that the probability distribution is valid. Here is a snippet of the get_random_nodes
method for reference:
def get_random_nodes(self, k=1, alpha=0.1) -> t.List[Node]:
def adjustment_factor(wins, alpha):
return np.exp(-alpha * wins)
scores = [adjustment_factor(node.wins, alpha) for node in self.nodes]
similarity_scores = [node.doc_similarity for node in self.nodes]
prob = np.array(scores) * np.array(similarity_scores)
prob = prob / np.sum(prob)
if len(self.nodes) == 0 or np.sum(prob) == 0:
return []
nodes = rng.choice(np.array(self.nodes), size=k, p=prob).tolist()
for node in nodes:
idx = self.nodes.index(node)
self.nodes[idx].wins += 1
return nodes
Ensure that the self.nodes
list is not empty and that the probability distribution prob
is valid. If self.nodes
is empty or np.sum(prob)
is zero, the method will return an empty list, leading to the error [1].
To continue talking to Dosu, mention @dosu.
@dosu yes the documents are not getting added to the docstore. Also I am getting Loaded 10 Documents too. Any Specific reason for that
The documents are not being added to the docstore
because the add_nodes
method might not be called correctly, or the nodes are not properly initialized before being added. Additionally, ensure that the embeddings
and extractor
are set before adding nodes.
Here's an example of how to correctly add nodes to the InMemoryDocumentStore
:
from ragas.testset.docstore import InMemoryDocumentStore
from ragas.testset.extractor import KeyphraseExtractor
from langchain.text_splitter import TokenTextSplitter
from ragas.embeddings import HuggingFaceEmbeddings
from ragas.llms import load_model
# Initialize components
embedding_model_name = 'BAAI/bge-small-en-v1.5'
embeddings = HuggingFaceEmbeddings(
model_name=embedding_model_name,
model_kwargs={'device': 'cuda:0'}
)
model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
response_generation_llm = load_model(model_name, 0.1, 1024)
splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)
keyphrase_extractor = KeyphraseExtractor(llm=response_generation_llm)
# Initialize the DocumentStore
docstore = InMemoryDocumentStore(
splitter=splitter,
embeddings=embeddings,
extractor=keyphrase_extractor,
llm=response_generation_llm
)
# Load documents
loader = PubMedLoader("liver", load_max_docs=10)
documents = loader.load()
print(f"Loaded {len(documents)} documents")
# Add documents to the docstore
docstore.add_documents(documents)
print(f"Docstore contains {len(docstore.nodes)} nodes")
# Attempt to generate the testset
generator = TestsetGenerator.from_langchain(
response_generation_llm,
response_generation_llm,
embeddings,
docstore=docstore
)
try:
testset = generator.generate_with_langchain_docs(documents, test_size=2, distributions={'simple': 0.5, 'reasoning': 0.25, 'multi_context': 0.25})
print("Testset generated successfully")
except ValueError as e:
print(f"Error: {e}")
Ensure that the add_nodes
method is called with properly initialized nodes and that the embeddings
and extractor
are set before adding nodes. This should ensure that the documents are correctly added to the docstore
[1][2].
To continue talking to Dosu, mention @dosu.
I am using my own LLM and embeddings from hugging face so can you help me to make it as a subclass of BaseRagasLLM and BaseRagasEmbeddings
To integrate your custom LLM and embeddings from Hugging Face by subclassing BaseRagasLLM
and BaseRagasEmbeddings
, you can follow these examples:
BaseRagasLLM
from ragas.llm.base import BaseRagasLLM # Adjust the import based on actual location
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
class CustomHuggingfaceLLM(BaseRagasLLM):
def __init__(self, model_name: str, cache_folder: t.Optional[str] = None, model_kwargs: t.Dict[str, t.Any] = None):
self.model_name = model_name
self.cache_folder = cache_folder
self.model_kwargs = model_kwargs if model_kwargs else {}
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name, cache_dir=self.cache_folder)
self.model = AutoModelForCausalLM.from_pretrained(self.model_name, cache_dir=self.cache_folder, **self.model_kwargs)
def generate(self, prompt: str, max_length: int = 50) -> str:
inputs = self.tokenizer(prompt, return_tensors="pt")
outputs = self.model.generate(inputs["input_ids"], max_length=max_length)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
BaseRagasEmbeddings
from ragas.embeddings.base import BaseRagasEmbeddings # Adjust the import based on actual location
from transformers import AutoModel, AutoTokenizer
import torch
class CustomHuggingfaceEmbeddings(BaseRagasEmbeddings):
def __init__(self, model_name: str, cache_folder: t.Optional[str] = None, model_kwargs: t.Dict[str, t.Any] = None):
self.model_name = model_name
self.cache_folder = cache_folder
self.model_kwargs = model_kwargs if model_kwargs else {}
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name, cache_dir=self.cache_folder)
self.model = AutoModel.from_pretrained(self.model_name, cache_dir=self.cache_folder, **self.model_kwargs)
def embed_query(self, text: str) -> List[float]:
return self.embed_documents([text])[0]
def embed_documents(self, texts: List[str]) -> List[List[float]]:
inputs = self.tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = self.model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1).tolist()
return embeddings
These examples show how to subclass BaseRagasLLM
and BaseRagasEmbeddings
to integrate custom models from Hugging Face. Adjust the imports and methods as necessary based on your specific requirements and the actual implementation of BaseRagasLLM
and BaseRagasEmbeddings
[1][2].
To continue talking to Dosu, mention @dosu.
@Rugved2204 were you able to solve this?
[ ] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug ValueError: a cannot be empty unless no samples are taken
Ragas version: 0.1.10 Python version:3.10.12
Code to Reproduce Share code to reproduce the issue
Error trace
Expected behavior It should have generated the synthetic testset. Nodes are getting embedded but getting an error while generating
Additional context Add any other context about the problem here.