explodinggradients / ragas

Supercharge Your LLM Application Evaluations 🚀
https://docs.ragas.io
Apache License 2.0
7.17k stars 730 forks source link

I am getting error while synthetic test data generation #1332

Open wanjeakshay opened 1 month ago

wanjeakshay commented 1 month ago

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug I am getting error while synthetic test data generation using hugging face models

Ragas version: 0.1.20 Python version:3.10.12

Code to Reproduce from ragas.run_config import RunConfig

Configure the generator with custom parameters to avoid timeouts

run_config = RunConfig( max_wait=60, max_retries=1000, max_workers=100, # Adjust the number of workers timeout=6000,

rate_limits={"requests_per_minute": 60}, # Set rate limits

# timeouts={"connect_timeout": 10, "read_timeout": 30}  # Set timeouts

)

generator = TestsetGenerator.from_langchain( generator_llm=llm_model, critic_llm=llm_model, embeddings=embeddings, run_config=run_config )

Generate the test set

testset = generator.generate_with_langchain_docs( documents=documents, test_size=5, # Example test size

distributions={simple: 0.5, reasoning: 0.3, multi_context: 0.2},

with_debugging_logs=False, is_async=True, raise_exceptions=False, # Set to False to avoid ExceptionInRunner error run_config=run_config )

Error trace WARNING:ragas.testset.docstore:Filename and doc_id are the same for all nodes.

ValueError Traceback (most recent call last) in <cell line: 6>() 4 5 # Generate the test set ----> 6 testset1 = generator.generate_with_langchain_docs( 7 documents=documents, 8 test_size=5, # Example test size

3 frames /usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate_with_langchain_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 208 ) 209 --> 210 return self.generate( 211 test_size=test_size, 212 distributions=distributions,

/usr/local/lib/python3.10/dist-packages/ragas/_analytics.py in wrapper(*args, kwargs) 127 def wrapper(*args: P.args, *kwargs: P.kwargs) -> t.Any: 128 track(IsCompleteEvent(event_type=func.name, is_completed=False)) --> 129 result = func(args, kwargs) 130 track(IsCompleteEvent(event_type=func.name, is_completed=True)) 131

/usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate(self, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 278 current_nodes = [ 279 CurrentNodes(root_node=n, nodes=[n]) --> 280 for n in self.docstore.get_random_nodes(k=test_size) 281 ] 282 total_evolutions = 0

/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in get_random_nodes(self, k, alpha) 325 prob = prob / np.sum(prob) 326 --> 327 nodes = self.run_config.rng.choice( 328 np.array(self.nodes), size=k, p=prob 329 ).tolist()

numpy/random/_generator.pyx in numpy.random._generator.Generator.choice()

ValueError: a cannot be empty unless no samples are taken

Expected behavior This should get us question and answer pairs.

Additional context using open source models embeddings = HuggingFaceEmbeddings(model_name="sangmini/msmarco-cotmae-MiniLM-L12_en-ko-ja")

repo_id = "mistralai/Mistral-7B-Instruct-v0.2"

repo_id = "mistralai/Mistral-Nemo-Instruct-2407"

llm_model = HuggingFaceEndpoint( repo_id=repo_id,

max_length=128,

temperature=0.5,
huggingfacehub_api_token=hf_api_key,

) llm_chain = prompt | llm_model print(llm_chain.invoke({"question": question}))

dosubot[bot] commented 1 month ago

I found a similar issue that might help you resolve the ValueError: "a cannot be empty unless no samples are taken" error. This error typically occurs when the documents list is empty. Ensure that the documents list is not empty before calling the generate_with_langchain_docs method.

Here is an example of how to check if the documents list is not empty:

# Ensure documents are loaded
if not documents:
    raise ValueError("The documents list cannot be empty.")

# Generate the test set
testset = generator.generate_with_langchain_docs(
    documents=documents,
    test_size=5,  # Example test size
    # distributions={simple: 0.5, reasoning: 0.3, multi_context: 0.2},
    with_debugging_logs=False,
    is_async=True,
    raise_exceptions=False,  # Set to False to avoid ExceptionInRunner error
    run_config=run_config
)

By checking that the documents list is not empty before calling the method, you can avoid the "ValueError: a cannot be empty unless no samples are taken" error [1].

Additionally, ensure that your documents variable is correctly populated with valid data. Here is an example of loading documents:

from llama_index import SimpleDirectoryReader

# Load documents
dir_path = "./path-to-your-documents"
reader = SimpleDirectoryReader(dir_path, num_files_limit=2)
documents = reader.load_data()

# Check if documents are loaded
if not documents:
    raise ValueError("No documents found. Ensure the directory contains valid files.")

This should help you resolve the issue and generate the synthetic test data successfully.

To continue talking to Dosu, mention @dosu.

dosubot[bot] commented 1 month ago

It looks like you're encountering a ValueError: a cannot be empty unless no samples are taken error when generating synthetic test data using Hugging Face models with Ragas. This error typically occurs when the list of documents is empty, and the method attempts to sample from it.

To resolve this issue, ensure that the documents list is not empty before calling the generate_with_langchain_docs method. Here is an example of how to do this:

from ragas.run_config import RunConfig
from ragas.testset.generator import TestsetGenerator

# Configure the generator with custom parameters to avoid timeouts
run_config = RunConfig(
    max_wait=60,
    max_retries=1000,
    max_workers=100,  # Adjust the number of workers
    timeout=6000,
    # rate_limits={"requests_per_minute": 60},  # Set rate limits
    # timeouts={"connect_timeout": 10, "read_timeout": 30}  # Set timeouts
)

generator = TestsetGenerator.from_langchain(
    generator_llm=llm_model, critic_llm=llm_model, embeddings=embeddings, run_config=run_config
)

# Ensure documents are not empty
if not documents:
    raise ValueError("The documents list cannot be empty.")

# Generate the test set
testset = generator.generate_with_langchain_docs(
    documents=documents,
    test_size=5,  # Example test size
    # distributions={simple: 0.5, reasoning: 0.3, multi_context: 0.2},
    with_debugging_logs=False,
    is_async=True,
    raise_exceptions=False,  # Set to False to avoid ExceptionInRunner error
    run_config=run_config
)

By checking that the documents list is not empty before calling the method, you can avoid the "ValueError: a cannot be empty unless no samples are taken" error [1][2].

To continue talking to Dosu, mention @dosu.

wanjeakshay commented 1 month ago

@dosu documents list is not empty, I has 3 pages data, still I am getting the same error

dosubot[bot] commented 1 month ago

The error ValueError: a cannot be empty unless no samples are taken can still occur even if the documents list is not empty due to specific conditions in the TestsetGenerator class. Here are some potential issues to check:

  1. Node Embeddings: Ensure that all nodes have valid embeddings. If any node's embedding is None, it will cause an assertion error.

    for node in self.nodes:
        assert node.embedding is not None, "Embedding cannot be None"
  2. Document Similarity Calculation: Verify that the document similarity scores are correctly calculated and do not contain NaN or infinite values.

    prob = prob / np.sum(prob)
  3. Run Configuration: Ensure that the run_config.rng (random number generator) is properly set.

    nodes = self.run_config.rng.choice(
        np.array(self.nodes), size=k, p=prob
    ).tolist()
  4. Node Similarity Scores: Check that similarity scores are properly assigned to nodes.

    similarity_scores = [node.doc_similarity for node in self.nodes]

These conditions can cause the get_random_nodes method to fail even when the documents list is not empty [1].

To continue talking to Dosu, mention @dosu.

wanjeakshay commented 1 month ago

@dosu I am getting new error now.

WARNING:ragas.testset.docstore:Filename and doc_id are the same for all nodes.

ValueError Traceback (most recent call last) in <cell line: 6>() 4 5 # Generate the test set ----> 6 testset = generator.generate_with_langchain_docs( 7 documents=documents, 8 test_size=10, # Example test size

3 frames /usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate_with_langchain_docs(self, documents, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 208 ) 209 --> 210 return self.generate( 211 test_size=test_size, 212 distributions=distributions,

/usr/local/lib/python3.10/dist-packages/ragas/_analytics.py in wrapper(*args, kwargs) 127 def wrapper(*args: P.args, *kwargs: P.kwargs) -> t.Any: 128 track(IsCompleteEvent(event_type=func.name, is_completed=False)) --> 129 result = func(args, kwargs) 130 track(IsCompleteEvent(event_type=func.name, is_completed=True)) 131

/usr/local/lib/python3.10/dist-packages/ragas/testset/generator.py in generate(self, test_size, distributions, with_debugging_logs, is_async, raise_exceptions, run_config) 278 current_nodes = [ 279 CurrentNodes(root_node=n, nodes=[n]) --> 280 for n in self.docstore.get_random_nodes(k=test_size) 281 ] 282 total_evolutions = 0

/usr/local/lib/python3.10/dist-packages/ragas/testset/docstore.py in get_random_nodes(self, k, alpha) 325 prob = prob / np.sum(prob) 326 --> 327 nodes = self.run_config.rng.choice( 328 np.array(self.nodes), size=k, p=prob 329 ).tolist()

numpy/random/_generator.pyx in numpy.random._generator.Generator.choice()

ValueError: a cannot be empty unless no samples are taken

code to reproduce

generator = TestsetGenerator.from_langchain( generator_llm=llm_model, critic_llm=llm_model, embeddings=embeddings, chunk_size=500 )

Generate the test set

testset = generator.generate_with_langchain_docs( documents=documents, test_size=10, # Example test size distributions={simple: 0.5, reasoning: 0.3, multi_context: 0.2}, with_debugging_logs=False, is_async=True, raise_exceptions=False, # Set to False to avoid ExceptionInRunner error )

input variable

documents = [Document(metadata={'source': '/content/ragas/indian-economy-1-3.txt'}, page_content='Indian Economy Introduction - Economy Study Material & NotesIndia is a developing country and our economy is a mixed economy where the public sector co-exists with the private sector. For an overview of Indian Economy, we should first go through the strengths of Indian economy.India is likely to be the third largest economy with a GDP size of $15 trillion by 2030.The economy of India is currently the world’s fourth largest in terms of real GDP (purchasing power parity) after the USA, China and Japan and the second fastest growing major economy in the world after China.Indian economy growth rate is estimated to be around seven to eight percent by year 2015-16. [box type=”shadow” align=”aligncenter” ]Let’s look at some facts from history regarding India as an Economy. Dadabhai Naoroji is known as the Father of Indian Politics and Economics, also known as the ‘Grand Old Man of India’. Dadabhai Naoroji was the first to calculate the national income of India. In his book “Poverty and Un-British Rule in India” he describes his theory, i.e. the economic exploitation of India by the British. His theory is popularly called the Economic Drain Theory. Thats when economy of India came into discussion as an entity, prior to that it was just a scramble of princely states and colonisers. Thats all the history there for time being. [/box]Introduction to Indian Economy:-Low per capita income. Inequalities in income distribution. Predominance of agriculture.\n\n(More than 2/3rd of India’s working population is engaged in agriculture. But in USA only 2% of the working population is engaged in agriculture.) Rapidly growing population with 1.2% annual change. Chronic unemployment (A person is considered employed if he / she works for 273 days of a year for eight hours every day. )Unemployment in India is mainly structural in nature. Low rate of capital formation due to less saving rate. Dualistic Nature of Economy (features of a modern economy, as well as traditional).Mixed Economy Follows Labour Intensive Techniques and activities.Agriculture in Indian economy:-While Indian economy introduction is started, the major focus is always on the agriculture sector. This is because Indian economy is based on agriculture.52% of the total population of India depends on agriculture.According to the 2011-2012 survey of Indian agriculture contributes 14.1% of the Gross Domestic Product (GDP). It was 55.4% in 1950-1951.India is the second largest sugar producer in the world (after Brazil). In tea production, India ranks first. (27% of total production in the world).Wheat production: Uttar Pradesh is the largest producer. Punjab and Haryana is then the second and the third largest producer of wheat.Rice production:The principal food grain in India is rice. West Bengal is the largest producer. Uttar Pradesh is the second largest producer of Punjab and is the third largest producer of rice.\n\nIndian Economy Introduction - Economy Study Material & NotesSectors of Indian Economy:-1. Primary Sector: When the economic activity depends mainly on exploitation of naturalresources then that activity comes under the primary sector. Agriculture and agriculture related activities are the primary sectors of economy.2. Secondary Sector: When the main activity involves manufacturing then it is the secondarysector. All industrial production where physical goods are produced come under the secondary sector.3. Tertiary Sector: When the activity involves providing intangible goods like services then this is partof the tertiary sector. Financial services, management consultancy, telephony and IT are examples of service sector.Other Classifications of Economy:-In Indian economy introduction, the sectors of economy based on other basis is also required to get a clear picture of the strengths of Indian Economy.1. Organized Sector: The sector which carries out all activity through a system and follows the lawof the land is called organized sector. Moreover, labour rights are given due respect and wages are as per the norms of the country and those of the industry. Labour working organized sector get the benefit of social security net as framed by the Government. Certain benefits like provident fund, leave entitlement, medical benefits and insurance are provided to workers in the organized sector.\n\nThese security provisions are necessary to provide source of sustenance in case of disability or death of the main breadwinner of the family without which the dependents will face a bleak future.2. Unorganized Sector: The sectors which evade most of the laws and don’t follow the systemcome underunorganized sector. Small shopkeepers, some small scale manufacturing units keep all their attention on profit making and ignore their workers basic rights. Workers don’t get adequate salary and other benefits like leave, health benefits and insurance are beyond the imagination of people working in unorganized sectors.3. Public Sector: Companies which are run and financed by the Government comprises thepublic sector. After independence India was a very poor country. India needed huge amount of money to set up manufacturing plants for basic items like iron and steel, aluminium, fertilizers and cements. Additional infrastructure like roads, railways, ports and airports also require huge investment. In those days Indian entrepreneur was not cash rich so government had to start creating big public sector enterprises like SAIL (Steel Authority of India Limited), ONGC(Oil & Natural Gas Commission).4.\n\nPrivate Sector: Companies which are run and financed by private people comprise theprivate sector.Companies like Hero Honda, Tata are from private sectors.Decoding the fourth quarter of fiscal 2023 to 2024India’s GDP grew 8.15% YoY over fiscal 2023 to 2024, with a sharp growth of 7.8% in the fourth quarter of 2023, beating the government’s second advanced estimate of 7.6% and Reserve Bank of India’s (RBI) estimate of 7.3%. Fourth-quarter economic activities pointed to three interesting trends: improvement in private consumption, exports, and manufacturing.Firstly, private consumption spending fared better last year than reported earlier. There were upward revisions to the third-quarter data, suggesting consumers spent more generously in the quarter of festivals and the Cricket World Cup than previously believed. Third-quarter GDP growth was revised to 8.6%, up from the earlier estimate of 8.4%, on the back of a 50-basis point upward revision of private consumption spending. That said, private consumption growth remained capped at 4.03% over fiscal 2023 to 2024, owing to modest growth in the agriculture sector and persistent inflation that weighed on rural demand.Secondly, Indian exports surged by an impressive 8.1% YoY in the fourth quarter—the highest this fiscal year. Merchandise exports improved remarkably that quarter (even though the quarter’s performance could not prevent annual merchandise export growth from contracting).\n\nNotably, high-value manufactured goods like pharmaceuticals, chemicals, engineering products, and electronics achieved record export levels this quarter, propelling overall growth in merchandise exports in the high value-added segment. This positive trend bodes well for India as it aims to strengthen its integration into the global value chain and increase its exports to US$2 trillion over the next six years. Growth in the traditional basket of exports, however, continued to decline And finally, 8.9% growth in manufacturing that quarter points toward sustained momentum. The index of industrial production also hinted at a strong revival in the sector. A few other factors affecting the growth numbers observed in the last quarter are:Modest investment: The momentum in gross fixed capital formation slowed to 6.5% YoY last quarter reflecting a temporary weariness due to national elections in India and other major countries, the escalation of geopolitical tensions in Central Asia, and the economic slowdown in China.Demand exceeding supply: The divergence between GDP and GVA (gross value added) narrowed marginally last quarter to 1.5% relative to the previous quarter’s 1.8%, but the gap continued to remain high.\n\nThis possibly indicates that the demand side (as measured by the expenditure approach) is growing faster than the supply side (denoted by the production approach).Fiscal deficit defying expectations: The fiscal deficit narrowed to 5.63% of GDP, lower than both initial budget estimates (5.9%) and revised estimates (5.8%).')]

dosubot[bot] commented 1 month ago

The error ValueError: a cannot be empty unless no samples are taken is likely occurring because the self.docstore.get_random_nodes(k=test_size) call in the generate method is returning an empty list. This can happen if the docstore does not contain any documents or nodes, even though your documents list is not empty.

Ensure that the docstore is properly initialized and that documents are correctly added to it. You might want to add debugging logs to verify that documents are being added to the docstore as expected.

Here is a snippet to add debugging logs:

def generate_with_langchain_docs(
    self,
    documents: t.Sequence[LCDocument],
    test_size: int,
    distributions: t.Optional[Distributions] = None,
    with_debugging_logs=False,
    is_async: bool = True,
    raise_exceptions: bool = True,
    run_config: t.Optional[RunConfig] = None,
):
    distributions = distributions or {}
    # chunk documents and add to docstore
    docs_to_add = [Document.from_langchain_document(doc) for doc in documents]
    self.docstore.add_documents(docs_to_add)

    # Debugging log
    logger.debug(f"Added {len(docs_to_add)} documents to the docstore.")

    return self.generate(
        test_size=test_size,
        distributions=distributions,
        with_debugging_logs=with_debugging_logs,
        is_async=is_async,
        raise_exceptions=raise_exceptions,
        run_config=run_config,
    )

Additionally, ensure that the docstore is not empty before calling generate:

if not self.docstore.get_all_documents():
    raise ValueError("Docstore is empty. Ensure documents are added correctly.")

By adding these checks and logs, you can better understand why the docstore might be empty and address the root cause of the issue [1][2].

To continue talking to Dosu, mention @dosu.