alondmnt / joplin-plugin-jarvis

Joplin (note-taking) assistant running a very intelligent system (OpenAI/GPT, Hugging Face, Gemini, Llama, Universal Sentence Encoder, etc.)
GNU Affero General Public License v3.0
226 stars 22 forks source link

option to split by paragraphs #11

Closed ahxxm closed 1 year ago

ahxxm commented 1 year ago

split by min(max_token, tokens_of_X_paragraphs)

pro:

cons:

// pseudo code anyway
embeddings = []
cur_split, cur_token = [], 0
for (p in paragraphs) {
  tokens_p = calc_tokens(p);
  if ((cur_token + tokens_p) >= max_tokens) {
    embeddings.push(calc_embeddings(cur_split));
    cur_split = [p];
    cur_token = tokens_p;
  } else {
    cur_split.push(p);
    cur_token += tokens_p;
  }
}
// and the embeddings of last split
alondmnt commented 1 year ago

Hi! Thanks for the feedback and suggestions.

Note that blocks are split by sentences instead of words for the same reason you mentioned (see here).

ahxxm commented 1 year ago

ah, thanks, the codes are quite similar! I have some articles in languages that don't split words by spaces, it seems I can still use split(/\s+/).length(basically split by paragraphs), hopefully the paragraphs are shorter than max_size

alondmnt commented 1 year ago

that's a good point. I'll make a note to better support such languages in the next release.