apache / camel-quarkus

Apache Camel Quarkus
https://camel.apache.org
Apache License 2.0
254 stars 189 forks source link

Langchain4J embeddings - tests and native support #5973

Open zbendhiba opened 4 months ago

zbendhiba commented 4 months ago

Describe the feature here

Improve the Langchain4J embeddings extension, to provide integration tests and native support

jamesnetherton commented 3 months ago

I wrote some tests but am currently blocked on a native issue related to JNI usage:

Caused by: java.lang.UnsatisfiedLinkError: ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.encode(JLjava/lang/String;Z)J [symbol: Java_ai_djl_huggingface_tokenizers_jni_TokenizersLibrary_encode or Java_ai_djl_huggingface_tokenizers_jni_TokenizersLibrary_encode__JLjava_lang_String_2Z]
    at org.graalvm.nativeimage.builder/com.oracle.svm.core.jni.access.JNINativeLinkage.getOrFindEntryPoint(JNINativeLinkage.java:152)
    at org.graalvm.nativeimage.builder/com.oracle.svm.core.jni.JNIGeneratedMethodSupport.nativeCallAddress(JNIGeneratedMethodSupport.java:54)
    at ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.encode(Native Method)
    at ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.encode(HuggingFaceTokenizer.java:213)
    at ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.encode(HuggingFaceTokenizer.java:224)
    at ai.djl.huggingface.tokenizers.HuggingFaceTokenizer.tokenize(HuggingFaceTokenizer.java:183)
    at dev.langchain4j.model.embedding.OnnxBertBiEncoder.embed(OnnxBertBiEncoder.java:59)
    at dev.langchain4j.model.embedding.AbstractInProcessEmbeddingModel.embedAll(AbstractInProcessEmbeddingModel.java:50)
    at dev.langchain4j.model.embedding.EmbeddingModel.embed(EmbeddingModel.java:34)
    at org.apache.camel.component.langchain4j.embeddings.LangChain4jEmbeddingsProducer.process(LangChain4jEmbeddingsProducer.java:41)

Seems there were some similar(ish) problems in quarkiverse-langchain4j. They actually have embeddings native tests disabled.

zbendhiba commented 3 months ago

@jamesnetherton Do you know from where this Tokenizer is pulled ? Is it coming directly from our camel component?

jamesnetherton commented 3 months ago

Do you know from where this Tokenizer is pulled

Basically whenever you use any langchain4j-embeddings-*.

jamesnetherton commented 2 months ago

I got a bit further by using quarkus-langchain4j-parsers-base. But am now stuck with a runtime segfault, similar to what the Quarkus Lanchain4j folks also encounter.

Starting the stack walk in a possible caller:
  A  SP 0x000000016fcf9900 IP 0x000000010051c6b4 size=240   ai.onnxruntime.OrtSession.run(Native Method)
  A  SP 0x000000016fcf99f0 IP 0x000000010051ad08 size=224   ai.onnxruntime.OrtSession.run(OrtSession.java:395)
  i  SP 0x000000016fcf9ad0 IP 0x00000001010549e8 size=128   ai.onnxruntime.OrtSession.run(OrtSession.java:242)
  i  SP 0x000000016fcf9ad0 IP 0x00000001010549e8 size=128   ai.onnxruntime.OrtSession.run(OrtSession.java:210)
  A  SP 0x000000016fcf9ad0 IP 0x00000001010549e8 size=128   dev.langchain4j.model.embedding.OnnxBertBiEncoder.encode(OnnxBertBiEncoder.java:115)
  A  SP 0x000000016fcf9b50 IP 0x0000000101053e0c size=112   dev.langchain4j.model.embedding.OnnxBertBiEncoder.embed(OnnxBertBiEncoder.java:64)
  A  SP 0x000000016fcf9bc0 IP 0x0000000101050540 size=112   dev.langchain4j.model.embedding.AbstractInProcessEmbeddingModel.embedAll(AbstractInProcessEmbeddingModel.java:50)
  A  SP 0x000000016fcf9c30 IP 0x0000000101052f7c size=96    dev.langchain4j.model.embedding.EmbeddingModel.embed(EmbeddingModel.java:34)
  A  SP 0x000000016fcf9c90 IP 0x000000010204d0f4 size=80    org.apache.camel.component.langchain4j.embeddings.LangChain4jEmbeddingsProducer.process(LangChain4jEmbeddingsProducer.java:41)