Evaluate distribution size

koppor commented 1 month ago

(Refs https://github.com/JabRef/jabref-issue-melting-pot/issues/462)

JabRef doubles distributions size if the AI feature is included. Do we really need 200 MB more of dependencies? Can we analyze the dependencies if really all are needed?

koppor commented 1 month ago

If we cannot shrink, we need to find ways to provide two editions of JabRef. Multi-project build with an optional sub project seems to be the most doable option.

koppor commented 1 month ago

DevCall decision: We aim for 200 MB size, no increase. 50 MB increase is OK, 100 MB not, 200 MB increase: no way

InAnYan commented 1 month ago

Here is some investigation of modules file. It contains mostly new modules and also notes about size:

overall (708M -> 1.4G [d~692M])

com.fasterxml.jackson.datatype.jdk8

flexmark?

jakarta.validation

jpro.webapi

jvm.openai

kotlin.stdlib (3.9M)

one.jpro.platform.mdfx

org.controlsfx.controls

org.jabref (9.2M -> 9.5M)

org.jabref.merged.module (148M -> 696M [d548M])
  - ai (378M)
    - onnxruntime (376M)
    - all-mini... (87M)
    - all-mini...-q (22M)
    - knuddels (3.3M)

  - dev.ai4j.openai4j

  - native.lib (49M) // and many OS-arch, contains some kind of tokenizers

// Why there are openai4j and jvm-openai

InAnYan commented 1 month ago

Time for ADR, what do you think:

Problem: we need to generate embeddings out of text chunks (Carl, you may be don't know, but for this DR it's enough for you to know, that in order to Q&A with papers every paper is split into chunks (300 words, for example), and all of these chunks must be "embedded"

1) Using local embedding model: Pros:

Running locally (privacy, etc.).
Zero price. Cons:
May be slow.
Not as powerful as OpenAI embeddings (meaning: OpenAI use powerful models that require very powerful hardware).
Bigger distribution size.

2) Using OpenAI API for embeddings generation: Pros:

Fast.
More powerfull than local models -> acurrate AI answers.
Small distribution size. Cons:
You pay money for that. And you will pay much more than just generating AI answers
(Maybe, this needs a more thorough research) increased embeddings cache. Because the output vector consists of ~1000 numbers, and if we use floats, so for every 300 words chunk (maybe I should increase that, let's say also average word is 5 ASCII letters) you will use 3005 + 10004=5500 bytes=~5.5kb

InAnYan commented 1 month ago

For embedding generation (that runs locally), langchain4j uses ONNX framework.

The problem is not even with embedding models.

We could leave only a quantized one (uncompressed it's 20M, not-quantized is ~85M), but the problem is with the onnxruntime library itself. It's packaged a little bit strange. Or maybe it uses FFI, and supplies for every OS-arch also some kind of a DLL.

We could download models at runtime, no problem. But we cannot change ONNX framework.

InAnYan commented 1 month ago

There is also a 3 way for ADR:

3) Use the second way, but also add ability to use local server

Description: many programs for running LLM models locally (Thi Lo is expert here) support running a REST server with API that is compatible to OpenAI. That means we can run everything locally, while JabRef will think that it talks to OpenAI.

So we need to do this: 1) JabRef would use OpenAI API for embeddings generation 2) We need to start a local server externally

In order to implement the 2nd point we to write comprehensible tutorials in order how to setup things like LM studio, or LocalAI, or Ollama, or other program

InAnYan commented 1 month ago

I think the easiest way to do all of this is:

Implement 2nd option (but I'm afraid of token usage, what if someone has a library consisting of books >1000 pages)
Implement 3rd option later, when we add support for local models

InAnYan commented 1 month ago

Model rankings (https://huggingface.co/spaces/mteb/leaderboard):

OpenAI embedding model:

Our current local embedding model:

InAnYan commented 1 month ago

It seems that with Deep Java Library (djl) the distribution size meets all the requirements!

Hope it really works (I just a bit suspicious, what if there is no strange issues with that library)!

InAnYan / jabref

Evaluate distribution size #79