Marker-Inc-Korea / AutoRAG

RAG AutoML Tool - Find optimal RAG pipeline for your own data.
Apache License 2.0
1.4k stars 121 forks source link

[BUG] Making new test set from existing queries fails #627

Closed dividor closed 1 month ago

dividor commented 1 month ago

Describe the bug Following the documentation, I am trying to create a test set from existing queries. It fails though on embedding (see below), but my input dataframe has good queries as far as I can tell.

To Reproduce

See code below.

Expected behavior Should work as documented.

Full Error log

---------------------------------------------------------------------------
BadRequestError                           Traceback (most recent call last)
Cell In[43], [line 58](vscode-notebook-cell:?execution_count=43&line=58)
     [55](vscode-notebook-cell:?execution_count=43&line=55) print(existing_qa_df["query"])
     [57](vscode-notebook-cell:?execution_count=43&line=57) llm = OpenAI(model=CHAT_MODEL, temperature=1.0)
---> [58](vscode-notebook-cell:?execution_count=43&line=58) qa_df = make_qa_with_existing_queries(corpus_df, existing_qa_df, content_size=5,
     [59](vscode-notebook-cell:?execution_count=43&line=59)                                       answer_creation_func=generate_answers,
     [60](vscode-notebook-cell:?execution_count=43&line=60)                                       llm=llm, output_filepath=QA_FILE, cache_batch=64,
     [61](vscode-notebook-cell:?execution_count=43&line=61)                                       embedding_model='openai_embed_3_large', top_k=1)
     [63](vscode-notebook-cell:?execution_count=43&line=63) # Prevent truncation of cell when using display
     [64](vscode-notebook-cell:?execution_count=43&line=64) pd.set_option('display.max_colwidth', None)

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/data/qacreation/base.py:140, in make_qa_with_existing_queries(corpus_df, existing_query_df, content_size, answer_creation_func, output_filepath, embedding_model, collection, upsert, random_state, cache_batch, top_k, **kwargs)
    [137](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/data/qacreation/base.py:137)     collection = chroma_client.get_or_create_collection(collection_name)
    [139](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/data/qacreation/base.py:139) # embed corpus_df
--> [140](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/data/qacreation/base.py:140) vectordb_ingest(collection, corpus_df, embeddings)
    [141](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/data/qacreation/base.py:141) vectordb_func = vectordb.__wrapped__
    [142](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/data/qacreation/base.py:142) retrieved_ids, retrieve_scores = vectordb_func(existing_query_df['query'].tolist(), top_k, collection, embeddings)

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/vectordb.py:116, in vectordb_ingest(collection, corpus_data, embedding_model, embedding_batch)
    [113](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/vectordb.py:113)     new_contents = openai_truncate_by_token(new_contents, openai_embedding_limit, embedding_model.model_name)
    [115](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/vectordb.py:115) new_ids = new_passage['doc_id'].tolist()
--> [116](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/vectordb.py:116) embedded_contents = embedding_model.get_text_embedding_batch(new_contents, show_progress=True)
    [117](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/vectordb.py:117) input_batches = create_batches(api=collection._client, ids=new_ids, embeddings=embedded_contents)
    [118](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/autorag/nodes/retrieval/vectordb.py:118) for batch in input_batches:

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:260, in Dispatcher.span.<locals>.wrapper(func, instance, args, kwargs)
    [252](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:252) self.span_enter(
    [253](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:253)     id_=id_,
    [254](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:254)     bound_args=bound_args,
   (...)
    [257](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:257)     tags=tags,
    [258](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:258) )
    [259](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:259) try:
--> [260](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:260)     result = func(*args, **kwargs)
    [261](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:261) except BaseException as e:
    [262](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py:262)     self.event(SpanDropEvent(span_id=id_, err_str=str(e)))

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:332, in BaseEmbedding.get_text_embedding_batch(self, texts, show_progress, **kwargs)
    [323](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:323) dispatcher.event(
    [324](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:324)     EmbeddingStartEvent(
    [325](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:325)         model_dict=model_dict,
    [326](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:326)     )
    [327](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:327) )
    [328](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:328) with self.callback_manager.event(
    [329](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:329)     CBEventType.EMBEDDING,
    [330](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:330)     payload={EventPayload.SERIALIZED: self.to_dict()},
    [331](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:331) ) as event:
--> [332](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:332)     embeddings = self._get_text_embeddings(cur_batch)
    [333](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:333)     result_embeddings.extend(embeddings)
    [334](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:334)     event.on_end(
    [335](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:335)         payload={
    [336](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:336)             EventPayload.CHUNKS: cur_batch,
    [337](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:337)             EventPayload.EMBEDDINGS: embeddings,
    [338](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:338)         },
    [339](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/core/base/embeddings/base.py:339)     )

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:432, in OpenAIEmbedding._get_text_embeddings(self, texts)
    [425](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:425) """Get text embeddings.
    [426](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:426) 
    [427](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:427) By default, this is a wrapper around _get_text_embedding.
    [428](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:428) Can be overridden for batch queries.
    [429](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:429) 
    [430](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:430) """
    [431](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:431) client = self._get_client()
--> [432](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:432) return get_embeddings(
    [433](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:433)     client,
    [434](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:434)     texts,
    [435](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:435)     engine=self._text_engine,
    [436](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:436)     **self.additional_kwargs,
    [437](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:437) )

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:336, in BaseRetrying.wraps.<locals>.wrapped_f(*args, **kw)
    [334](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:334) copy = self.copy()
    [335](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:335) wrapped_f.statistics = copy.statistics  # type: ignore[attr-defined]
--> [336](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:336) return copy(f, *args, **kw)

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:475, in Retrying.__call__(self, fn, *args, **kwargs)
    [473](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:473) retry_state = RetryCallState(retry_object=self, fn=fn, args=args, kwargs=kwargs)
    [474](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:474) while True:
--> [475](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:475)     do = self.iter(retry_state=retry_state)
    [476](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:476)     if isinstance(do, DoAttempt):
    [477](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:477)         try:

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:376, in BaseRetrying.iter(self, retry_state)
    [374](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:374) result = None
    [375](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:375) for action in self.iter_state.actions:
--> [376](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:376)     result = action(retry_state)
    [377](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:377) return result

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:398, in BaseRetrying._post_retry_check_actions.<locals>.<lambda>(rs)
    [396](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:396) def _post_retry_check_actions(self, retry_state: "RetryCallState") -> None:
    [397](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:397)     if not (self.iter_state.is_explicit_retry or self.iter_state.retry_run_result):
--> [398](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:398)         self._add_action_func(lambda rs: rs.outcome.result())
    [399](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:399)         return
    [401](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:401)     if self.after is not None:

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:449, in Future.result(self, timeout)
    [447](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:447)     raise CancelledError()
    [448](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:448) elif self._state == FINISHED:
--> [449](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:449)     return self.__get_result()
    [451](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:451) self._condition.wait(timeout)
    [453](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:453) if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:401, in Future.__get_result(self)
    [399](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:399) if self._exception:
    [400](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:400)     try:
--> [401](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:401)         raise self._exception
    [402](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:402)     finally:
    [403](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:403)         # Break a reference cycle with the exception in self._exception
    [404](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/concurrent/futures/_base.py:404)         self = None

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:478, in Retrying.__call__(self, fn, *args, **kwargs)
    [476](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:476) if isinstance(do, DoAttempt):
    [477](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:477)     try:
--> [478](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:478)         result = fn(*args, **kwargs)
    [479](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:479)     except BaseException:  # noqa: B902
    [480](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/tenacity/__init__.py:480)         retry_state.set_exception(sys.exc_info())  # type: ignore[arg-type]

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:180, in get_embeddings(client, list_of_text, engine, **kwargs)
    [176](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:176) assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048."
    [178](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:178) list_of_text = [text.replace("\n", " ") for text in list_of_text]
--> [180](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:180) data = client.embeddings.create(input=list_of_text, model=engine, **kwargs).data
    [181](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/llama_index/embeddings/openai/base.py:181) return [d.embedding for d in data]

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:114, in Embeddings.create(self, input, model, dimensions, encoding_format, user, extra_headers, extra_query, extra_body, timeout)
    [108](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:108)         embedding.embedding = np.frombuffer(  # type: ignore[no-untyped-call]
    [109](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:109)             base64.b64decode(data), dtype="float32"
    [110](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:110)         ).tolist()
    [112](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:112)     return obj
--> [114](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:114) return self._post(
    [115](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:115)     "/embeddings",
    [116](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:116)     body=maybe_transform(params, embedding_create_params.EmbeddingCreateParams),
    [117](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:117)     options=make_request_options(
    [118](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:118)         extra_headers=extra_headers,
    [119](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:119)         extra_query=extra_query,
    [120](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:120)         extra_body=extra_body,
    [121](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:121)         timeout=timeout,
    [122](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:122)         post_parser=parser,
    [123](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:123)     ),
    [124](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:124)     cast_to=CreateEmbeddingResponse,
    [125](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/resources/embeddings.py:125) )

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1259, in SyncAPIClient.post(self, path, cast_to, body, options, files, stream, stream_cls)
   [1245](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1245) def post(
   [1246](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1246)     self,
   [1247](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1247)     path: str,
   (...)
   [1254](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1254)     stream_cls: type[_StreamT] | None = None,
   [1255](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1255) ) -> ResponseT | _StreamT:
   [1256](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1256)     opts = FinalRequestOptions.construct(
   [1257](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1257)         method="post", url=path, json_data=body, files=to_httpx_files(files), **options
   [1258](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1258)     )
-> [1259](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1259)     return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:936, in SyncAPIClient.request(self, cast_to, options, remaining_retries, stream, stream_cls)
    [927](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:927) def request(
    [928](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:928)     self,
    [929](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:929)     cast_to: Type[ResponseT],
   (...)
    [934](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:934)     stream_cls: type[_StreamT] | None = None,
    [935](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:935) ) -> ResponseT | _StreamT:
--> [936](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:936)     return self._request(
    [937](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:937)         cast_to=cast_to,
    [938](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:938)         options=options,
    [939](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:939)         stream=stream,
    [940](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:940)         stream_cls=stream_cls,
    [941](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:941)         remaining_retries=remaining_retries,
    [942](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:942)     )

File ~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1040, in SyncAPIClient._request(self, cast_to, options, remaining_retries, stream, stream_cls)
   [1037](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1037)         err.response.read()
   [1039](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1039)     log.debug("Re-raising status error")
-> [1040](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1040)     raise self._make_status_error_from_response(err.response) from None
   [1042](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1042) return self._process_response(
   [1043](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1043)     cast_to=cast_to,
   [1044](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1044)     options=options,
   (...)
   [1048](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1048)     retries_taken=options.get_max_retries(self.max_retries) - retries,
   [1049](https://file+.vscode-resource.vscode-cdn.net/Users/matthewharris/Desktop/git/llm_eval_research/~/opt/miniconda3/envs/eval_demo/lib/python3.11/site-packages/openai/_base_client.py:1049) )

BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

Code that bug is happened

TEST_SET = f"{DATA_DIR}/Evaluation Test Set.csv"

# Fix encoding when reading csv file
existing_qa_df = pd.read_csv(TEST_SET, encoding='ascii', encoding_errors='ignore')

# Promote top row to column index
existing_qa_df.columns = existing_qa_df.iloc[1]

# Remove top two rows, then make top row column names
existing_qa_df = existing_qa_df[2:]

# It have to contain 'query' column
existing_qa_df = existing_qa_df.rename(columns={'Question': 'query'})
existing_qa_df = existing_qa_df.iloc[0:5]
print(existing_qa_df["query"])

llm = OpenAI(model=CHAT_MODEL, temperature=1.0)
qa_df = make_qa_with_existing_queries(corpus_df, existing_qa_df, content_size=5,
                                      answer_creation_func=generate_answers,
                                      llm=llm, output_filepath=QA_FILE, cache_batch=64,
                                      embedding_model='openai_embed_3_large', top_k=1)

This outputs ...

2                                         What were the effects of the pandemic on WFP activities in the different countries?
3    Was WFP able to maintaing its strategic positioning  during the pandemic  vis--vis the Government and other UN agencies?
4                                                  What are the common findings on WFP's adaptation in response to COVID-19? 
5                                 To what extent were interventions effective in helping to mitigate the effects of COVID-19?
6                                                            Was WFP timely in responding to needs under COVID-19 conditions?
Name: query, dtype: object

Desktop (please complete the following information):

Additional context Thanks!

dividor commented 1 month ago

This was because I had a text node with whitespace, I think it then failed on calling OpenAI embedding model. I removed the offending node, worked.

Would be cool to perhaps add some validation, so will leave open, but feel free to close otherwise.

vkehfdl1 commented 1 month ago

Yes it happens often. It seems good idea to add 'delete whitespace' or 'drop the row that have whitespace'. We will work on it at vectordb_ingest

dividor commented 1 month ago

Thanks!