Closed mcapitanio closed 2 months ago
Hi @mcapitanio A little more context here would help. I can see here the the your chunks are rater lower than required threshold. Are you using test generation in language other than English? If yes checkout https://docs.ragas.io/en/stable/howtos/applications/use_prompt_adaptation.html#language-adaptation-for-testset-generation Or I would love see the type of documents you're feeding into the library.
Hi @shahules786 ,
yes, I am using test generation for italian, this is my code:
def generate(args,
langchain_generation_llm: LangchainLLMWrapper,
langchain_critic_llm: LangchainLLMWrapper,
langchain_embeddings: LangchainEmbeddingsWrapper):
try:
loader = DirectoryLoader(path=args.documents_path, show_progress=True)
documents = loader.load()
for document in documents:
splitted = document.metadata['source'].split("/")
document.metadata['file_name'] = splitted[len(splitted) - 1]
splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)
keyphrase_extractor = KeyphraseExtractor(llm=langchain_generation_llm)
docstore = InMemoryDocumentStore(
splitter=splitter,
embeddings=langchain_embeddings,
extractor=keyphrase_extractor)
generator = TestsetGenerator(generator_llm=langchain_generation_llm,
critic_llm=langchain_critic_llm,
docstore=docstore,
embeddings=langchain_embeddings)
generator.adapt(language="italian", evolutions=[
simple, reasoning, conditional, multi_context], cache_dir=args.cache_path)
generator.save(evolutions=[
simple, reasoning, multi_context, conditional], cache_dir=args.cache_path)
test_dataset = generator.generate_with_langchain_docs(
documents, test_size=args.test_size, with_debugging_logs=True, distributions={simple: 1})
test_dataset.save("test_dataset.jsonl")
except Exception as e:
print(e)
I am trying to generate syntetic data using Azure Open AI in a simple case:
I have configured the Azure LLM for generator and critic as suggested in the documentation. When the generation start I see in the log these:
and it continue till it reach the call limint getting 429, waiting for the retry, resuming with other 200 code and then 429 again, and so on. It seems never stopping.
Why this behaviour? Any idea? How many call to LLM is expected for a test size of N with M documents? Is there any rule available to exstimate the cost?