different-ai / embeddings-utils

moved here https://github.com/different-ai/embedbase/tree/main/sdk/embedbase-js
MIT License
5 stars 1 forks source link

[Question / Feature Request] splitting markdown, code, etc. levels & hierarchies #3

Open louis030195 opened 1 year ago

louis030195 commented 1 year ago

I think this is more food for thought for longer term versions, to keep in mind.

Would it make sense to have an abstraction that facilitates the breakdown of data structures such as Markdown, code, possibly at different levels and even hierarchy?

Examples

Markdown


# A dog story

Bob the dog was running in the grass...

## Dog colour

Bob hair was yellow...

# A cat story

Alice the cat was running in the grass...

## Cat colour

Alice hair was black...
import { split } from 'embeddings-splitter';
const anAnimalStory = // read the md...
const chunks = split(anAnimalStory, { markdown: { levels: ['h1'] } });
console.log(chunks);

['# A dog story\n\nBob the dog was running in the grass...\n\n## Dog colour\n\nBob hair was yellow...', '# A cat story\n\nAlice the cat was running in the grass...\n\n## Alice colour\n\nAlice hair was black...']

Code / machine language

def foo():
  print("bar")

foo()

print("baz")

import os

if True:
  print("qux")

def something():
  print("something")
import { split } from 'embeddings-splitter';
const aCodeStory = // read the py...
const chunks = split(aCodeStory, { code: { levels: ['function'] } });
console.log(chunks);

['def foo():', 'def something():\n\tprint("something")']

Hierarchies

Just an initial thought, think "["h1": "A dog story", children: [...],...]", e.g. a tree data structure could make sense

More thoughts

Why would we split at different levels?

PS: If I remember correctly, saw somewhere that white spaces should be removed with current openai model for better perf (TODO: add ref)

louis030195 commented 1 year ago

example markdown splitter btw https://github.com/debanjum/khoj/blob/6c0e82b2d6b04f7d0944f122dc26d04063e3674e/src/khoj/processor/markdown/markdown_to_jsonl.py#LL100C5-L100C5