[Feature Concept] A collection of pre-indexed texts

do-me commented 8 months ago

Intro

This is a feature concept for a public collection of indexed texts.

The main idea is that there is one (or more) space(s) with publicly available bodies of text and their embeddings. This is especially interesting for

large bodies of text
medium-sized bodies of text of high public interest

Examples: Bible, Torah, Quran, public docs like the IPCC report, studies or similar

E.g. indexing the King James bible with 200 chars and a small model takes around 20 mins. Instead, loading a pre-indexed file to SemanticFinder would take a few seconds and the cosine distance calculation, sorting etc. takes only a second.

For small bodies of text (poems one-pager docs etc.) it is not really relevant as the initial indexing would be fast anyway.

Idea

My idea for the moment is a separate GitHub repo for the pre-indexed files containing the text, embeddings & settings

pro: separate from main repo for clear separation of concerns, the app vs. data repo (possibly allowing for other repos or places where the files could be stored)
con: GitHub file size of 50Mb, maybe keeping a few examples in the same repo would be good as an example?

Not quite sure where else these files could be stored for free. If there is some kind of public interest, maybe looking for a sponsor would be an option?

Initially the 50Mb file size might even be enough, considering e.g. the ~700 pages of the bible have 4.35 Mb uncompressed as plain text and ~1.32Mb gzipped (ultra setting with 7zip). Text is absolutely neglectable from a file size point of view.

Considering embeddings, the largest test I ran was 134k embeddings in ~38Mb with 2 decimals precision, yielding good results.

In the case of the bible, indexing it with 200 chars results in only ~23k embeddings so quite much under the above threshold with probably ~7Mb

Not quite sure what book/document should be much bigger.

Tasks

Considering that there is already a working minimal PoC, it shouldn't be too hard to update the logic for the current version of the app. However, in order to be sustainable, I think it's worth thinking about it more carefully:

[ ] Define a file specification. The most simple I can think of would be a JSON in the following format:

{
"title": "King James Bible", 
"meta": {"chunk_type":"chars", "chunk_size": 100, "model_name": "Xenova/bge-small", ...}
"embeddings": [
  {"text": "twenty thousand drams of gold, and two thousand pounds of",
  "embedding": [0.234, 0.242, 0.345, 0.131, ... , 0.234] }, 
  {...}
  ]
}

Concatting all the text with spaces would then result in the original text. This approach might have some disadvantages though and it might be better to hold the original text as one node and then use indices like this:

{
"title": "King James Bible", 
"meta": {"chunk_type":"chars", "chunk_size": 100, "model_name": "Xenova/bge-small", ...}
"text": "all the bible text goes here .... plenty of chars! ",
"embeddings": [
{"index": [0,99], "embedding": [0.234, 0.242, 0.345, 0.131, ... , 0.234] }
{"index": [100,199], "embedding": [0.234, 0.242, 0.345, 0.131, ... , 0.234] }
{"index": [100,199], "embedding": [0.234, 0.242, 0.345, 0.131, ... , 0.234] }
...]
}

This would allow for double indexation like this: [0,100] and [0,9] , [10,19], ... so one could have coarser embeddings and fine-grained embeddings in the same file.

Also, one could load e.g. store the text and the indices seperately and first load the text once, then load the first index for coarse search. Then during the first search already load a finer index and so on. UX could be very nice this way.

A gzipped JSON already offers quite a good compression and it's easy to use with pako.js. However, using a modern format like parquet would offer other benefits like speed and further compression. E.g. with https://loaders.gl/docs/modules/parquet/api-reference/parquet-loader it could be loaded fairly easily.

I would love to develop some kind of standardized format with e.g. the main open source vector DB organizations to make it interoperable. E.g. it would be so cool to send such a standardized file around and convert a "semantic JSON" file from the cloud directly to e.g. a SemanticFinder/Qdrant/Milvus/Pinecone collection and vice versa! I will ping some folks and hear their opinion about it.

For example one could simply copy and paste a text in SemanticFinder and do the indexation there with a bit of trial and error (playing around with the settings, chunking sizes etc.). When you find the right settings just hit "export to Qdrant/Milvus/Pinecone" and you're done.

[ ] Where to host these files, GitHub? Better free/sponsored options?
[ ] Choose some diverse and interesting examples (e.g. large religious, scientific, historic books etc.)
[ ] [Optional] It might require a rewrite of the core logic of SemanticFinder. Working with indices instead of the actual text in a dict might be better suited for this use case.
[ ] Implement the logic as a PoC with easy import/export functionality
[ ] For reduced file size: explore product quantization, e.g. with https://github.com/Lv-291/wasm-bhtsne

I would love to hear some opinions about this! (@VarunNSrivastava @lizozom)

do-me commented 8 months ago

Went ahead with https://github.com/do-me/SemanticFinder/commit/e41c5ad12cdd7958f0ab083106fe700808472dc5 It includes full local import export logic. Adding remote file import now to enable cloud-based archives.

It's pretty cool: A gzipped json index file for whole bible with ~23k embeddings and 5 decimals precision with 200 chars chunks including the text itself weighs around 25Mb. 5 decimals precision might even be overkill as mentioned above. Reducing only one decimal e.g. from 5 to 4 decimal precision would mean - 23.000 * 384 characters in the JSON so 8.8Mio (!) chars less.

Decimal Precision	Size (Mb)
2	11.719
3	17.104
4	22.066
5	26.172

Attached the 4 decimals file here for quick tests: bible-gte-200-4dec.json.gz

Just import it and hit find as usual.

do-me commented 8 months ago

There it is:

https://huggingface.co/datasets/do-me/SemanticFinder

HF is a much better suited place for such a collection than GitHub. It offers some interesting features for the future such as automatic parquet conversion.

If you click "load" in the advanced settings, an example is loaded right away.

do-me / SemanticFinder