[bug] LlamaIndexInstrumentor not working for BaseEmbeddings.get_text_embeddings_batch()

P1et1e commented 2 months ago

Describe the bug UI only shows request results of the last call of get_text_embbeddings_batch. So only the embeddings of the last 10 chunks are traced.

To Reproduce Using LLamaIndex SimpleDirectoryReader to read in a PDF file and generate a VectorStoreIndex.from_documents() using AzureOpenAIEmbeddings.

Expected behavior I expected to have multiple Traces under the span BaseEmbeddings one for each call of get_text_embeddings_batch and not only for the last call. Want to have a proper observability by tracing all calls to the embedding model API and also want to be able to use all embeddings for Inference.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Ubuntu
Version 20.04

axiomofjoy commented 2 months ago

Thanks @P1et1e. Can you send a snippet to reproduce the issue?

P1et1e commented 2 months ago

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding from phoenix.trace import using_project from openinference.instrumentation.llama_index import LlamaIndexInstrumentor from openinference.instrumentation.langchain import LangChainInstrumentor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk import trace as trace_sdk from opentelemetry.sdk.trace.export import SimpleSpanProcessor tracer_provider = trace_sdk.TracerProvider() tracer_provider.add_span_processor( SimpleSpanProcessor(OTLPSpanExporter("http://127.0.0.1:6006/v1/traces")) ) LlamaIndexInstrumentor().instrument( skip_dep_check=True, tracer_provider=tracer_provider, ) ada_002_embeddings = AzureOpenAIEmbedding() required_exts = [".pdf"] documents = SimpleDirectoryReader( input_dir="/dir/path/", required_exts=required_exts, recursive=True, ).load_data() with using_project("indexing"): index = VectorStoreIndex.from_documents( documents, embed_model=ada_002_embeddings )

@axiomofjoy And I am running a arize/phoenix docker container.
When I have a PDF Document with around 1400 Pages (Slides of a presentation), I have default one Document/Chunk per page. In the trace in the Phoenix UI I can see One Trace of kind embedding with name BaseEmbedding.get_text_embedding_batch under this trace only the Embeddings for the last batch of 10 are shown. When I click on Attributes i can see a json where under the key input there is a field value that contains an object with key texts where there is a list of all text chunks. I expected to see all embeddings there. Maybe even get a Span for each batch of embeddings.

axiomofjoy commented 2 months ago

Thank you for the details @P1et1e !

P1et1e commented 1 month ago

@axiomofjoy is there any update or anything i can assist with ?

axiomofjoy commented 1 month ago

Hey @P1et1e, this issue is not scheduled and I'm not sure when we will get to it as the team is currently heads down implementing authentication. Contributions are welcome!

Arize-ai / openinference

[bug] LlamaIndexInstrumentor not working for BaseEmbeddings.get_text_embeddings_batch() #983