Potentially repeated embedding of old notes

wenlzhang commented 1 year ago

Question 1

I updated to a newer version today, and realized that some notes may be somehow embedded again because I observed the following.

In the vault for testing Vault Chat, I only have one note, which is converted from a PDF article. The note has 6336 words in total, as indicated by Obsidian.

Yesterday, I asked several questions. From the OpenAI usage record, I see the following record has the most token usage. Other records consumed fewer than 500 tokens. I assume that this most usage is related to embedding the entire note.

ext-embedding-ada-002-v2, 4 requests
5,685 prompt + 0 completion = 5,685 tokens

Today, after updating the plugin, I also asked a few questions. I notice that the following record has the similar token usage as the previous record. Therefore, I was wondering if this is caused by the fact that the note is embedded again.

text-embedding-ada-002-v2, 2 requests
5,682 prompt + 0 completion = 5,682 tokens

Question 2

A related question is that when moving a note to a different folder within Obsidian, would the same note experience an embedding again? Or some kind of information would be cached/saved to avoid this?

I have doubts about this because in the file database2.json, it includes the full path of the note. Therefore, I assume this is either to avoid this from happening or may cause this.

kristenbrann commented 1 year ago

Will look into Question 1.

On Question 2, the way it is set up currently, it does run embedding on the note again if you move it or rename it. The thought process here was that the path your note is in could be contextually relevant. For example:

move path: "Foods I hate -> Chicken Recipes.md" is meaningfully different from "Foods I love -> Chicken Recipes.md".
rename: "Foods I hate.md" is meaningfully different from "Foods I love.md"

Very much open to suggestions and insights here, but that was the reasoning we used originally!

wenlzhang commented 1 year ago

The thought process here was that the path your note is in could be contextually relevant.

I understand the reasoning here.

On the other hand, I think the contextually relevant aspect may also depend on the setting and usage of the vault structure. For example, I may have a note in the Inbox folder. At a later state, I may move it to another folder. However, this change of folder does not cause any change in contextual meaning. Therefore, it is not necessary to embed the note again.

To address this, maybe there can be the following measures:

The note content, file name and file path are processed and passed separately to OpenAI. This way, the change in file name and path would not cause the re-embedding of note content.
There can be separate options in the configurations that allow users to choose whether to re-embed notes again when renaming and moving notes.

I guess this is also related to another issue, i.e., whether to re-embed the note if the content is updated. This can be especially important for large notes in my case. To address this, there may be the following measures:

One can set a percentage value in the configuration as the threshold for re-embedding the note. For instance, if the note content change is larger than 20%, then re-embed the note.
One can set to always re-embed notes if they are relatively small, as this may not cause a large token usage. For example, if the note contains fewer than 500 words, then Vault Chat always re-embed the note.

Of course, these two options can be combined in some way.

exoascension / vault-chat

Potentially repeated embedding of old notes #13

Question 1

Question 2