InAnYan / jabref

Graphical Java application for managing BibTeX and biblatex (.bib) databases
https://devdocs.jabref.org
MIT License
0 stars 0 forks source link

Evaluate distribution size #79

Closed koppor closed 1 month ago

koppor commented 1 month ago

(Refs https://github.com/JabRef/jabref-issue-melting-pot/issues/462)

JabRef doubles distributions size if the AI feature is included. Do we really need 200 MB more of dependencies? Can we analyze the dependencies if really all are needed?

koppor commented 1 month ago

If we cannot shrink, we need to find ways to provide two editions of JabRef. Multi-project build with an optional sub project seems to be the most doable option.

koppor commented 1 month ago

DevCall decision: We aim for 200 MB size, no increase. 50 MB increase is OK, 100 MB not, 200 MB increase: no way

InAnYan commented 1 month ago

Here is some investigation of modules file. It contains mostly new modules and also notes about size:

overall (708M -> 1.4G [d~692M])

com.fasterxml.jackson.datatype.jdk8

flexmark?

jakarta.validation

jpro.webapi

jvm.openai

kotlin.stdlib (3.9M)

one.jpro.platform.mdfx

org.controlsfx.controls

org.jabref (9.2M -> 9.5M)

org.jabref.merged.module (148M -> 696M [d548M])
  - ai (378M)
    - onnxruntime (376M)
    - all-mini... (87M)
    - all-mini...-q (22M)
    - knuddels (3.3M)

  - dev.ai4j.openai4j

  - native.lib (49M) // and many OS-arch, contains some kind of tokenizers

// Why there are openai4j and jvm-openai
InAnYan commented 1 month ago

Related issue: https://github.com/langchain4j/langchain4j/issues/1492

InAnYan commented 1 month ago

Time for ADR, what do you think:

Problem: we need to generate embeddings out of text chunks (Carl, you may be don't know, but for this DR it's enough for you to know, that in order to Q&A with papers every paper is split into chunks (300 words, for example), and all of these chunks must be "embedded"

1) Using local embedding model: Pros:

2) Using OpenAI API for embeddings generation: Pros:

InAnYan commented 1 month ago

For embedding generation (that runs locally), langchain4j uses ONNX framework.

The problem is not even with embedding models.

We could leave only a quantized one (uncompressed it's 20M, not-quantized is ~85M), but the problem is with the onnxruntime library itself. It's packaged a little bit strange. Or maybe it uses FFI, and supplies for every OS-arch also some kind of a DLL.

We could download models at runtime, no problem. But we cannot change ONNX framework.

InAnYan commented 1 month ago

There is also a 3 way for ADR:

3) Use the second way, but also add ability to use local server

Description: many programs for running LLM models locally (Thi Lo is expert here) support running a REST server with API that is compatible to OpenAI. That means we can run everything locally, while JabRef will think that it talks to OpenAI.

So we need to do this: 1) JabRef would use OpenAI API for embeddings generation 2) We need to start a local server externally

In order to implement the 2nd point we to write comprehensible tutorials in order how to setup things like LM studio, or LocalAI, or Ollama, or other program

InAnYan commented 1 month ago

I think the easiest way to do all of this is:

  1. Implement 2nd option (but I'm afraid of token usage, what if someone has a library consisting of books >1000 pages)
  2. Implement 3rd option later, when we add support for local models
InAnYan commented 1 month ago

image

InAnYan commented 1 month ago

Model rankings (https://huggingface.co/spaces/mteb/leaderboard):

OpenAI embedding model: image

Our current local embedding model: image

InAnYan commented 1 month ago

It seems that with Deep Java Library (djl) the distribution size meets all the requirements!

Hope it really works (I just a bit suspicious, what if there is no strange issues with that library)!