brianpetro / jsbrains

A collection of low-to-no dependency modules for building smart apps with JavaScript
https://jsbrains.org
MIT License
27 stars 7 forks source link

FR Smart Chunks Improved handling of common headings #3

Open LeoLDLeo opened 7 months ago

LeoLDLeo commented 7 months ago

Discussed in https://github.com/brianpetro/obsidian-smart-connections/discussions/416

Originally posted by **Levani307** January 16, 2024 Hello, I am a Smart Connection Supporter and I love using this plugin to find related notes in my Obsidian vault. However, I am encountering a problem. My daily notes follow the same structure and block names (e.g. ## Thoughts, ## Notes Created Today, etc.). Smart Connection matches other notes with the same structure and returns other notes with the same block names, even if other topics and themes are not quite that related. If I were to remove the block names, then finding related notes becomes a bit easier, but I lose the benefit of having a consistent format for my daily notes. Is there a way to make Smart Connection ignore the block names and focus on the content of the notes instead? Or the only way is to remove any structure in my daily notes so that Smart Connection can find more relevant matches? I would appreciate any ideas or help from anyone who had the same issue. Thank you.
eamonnvi commented 7 months ago

Yes, I hit this issue too. I got around it by serialising my note titles and copying the original title into the note as a footnote (with a bash script, I think) and that prevented the title block outweighing the content in the similarity listing. However, it does make the results a little more opaque to the human eye. It would be good to be able to have a switch to include/exclude the title. Of course, this may just be a measure of my incompetence in using this amazing technology and there may be easier ways of achieving the desired result.

brianpetro commented 7 months ago

@Levani307 @eamonnvi thanks for raising this issue.

I think that being able to toggle inclusion of headings makes a lot of sense, though it would come at the cost of losing context which could also have a negative impact on results.

There are some alternative approaches that I'd like to explore. For example, after the initial embeddings score, you can "re-rank" the results using various methods, this can be anything from simply reducing scores based on shared headings to processing the results through another AI model. Another method would be creating up to three embeddings for the same content, 1) the existing, 2) the content only, and 3) the path (file path plus headings) then calculating a final score based on all three of those.

Long-term the best method, and what I have my sights on, is likely a combination of these that is unique to each user. This would look like a reinforcement learning layer that adjusts the weights of the various score inputs based on feedback.

This problem also pops up related to template files, or empty notes with similar headings, being erroneously surfaced. I have a solution for that, which will be implemented relatively shortly (before v2 general release). And I think it might be similarly helpful for this issue. In short, it's a variation of the re-ranking mentioned above.

I'm also going to think about the simply toggling off the headings in the embeddings option. If I think it can be done relatively easily (probably should be), then I'll do that too.

Thanks for the helpful feedback & support 😊 Brian 🌴

eamonnvi commented 7 months ago

On 17 Jan 2024, at 14:10, WFH Brian @.***> wrote:

I'm also going to think about the simply toggling off the headings in the embeddings option. If I think it can be done relatively easily (probably should be), then I'll do that too.

I am pretty new to all this and I need to improve my understanding of how context works. But the solution proposed at the end of your email and copied above would work for my use case, which is the following:

I do most of my reading on a Kindle and highlight copiously. I recently exported all my highlights. The default output is a single document for all the highlights from a particular book. Wanting a finer granularity for Obsidian use, I split each document into individual files, titled with the name of the book and the author, an index number, and the link to the position in the Kindle file. This resulted in about 15,000 files/notes. Selecting a particular note in the Smart Connections Files tab therefore had a tendency to choose other highlights from the same book/author. Anonymising the files by removing author name/book title from the file title only leaving a five digit number (rather similar to the Zettelkasten naming process) produced material from a more diverse range of books/authors.

However, I do realise that context is important in generating better responses. This became apparent from a diametrically opposite experiment I conducted. I exported the 69 chapters (average length 2k words/chapter) of a novel I am writing into Obsidian. I used the BGT-small embedding model and the gpt-4-turbo(128k) Smart Chat model (ouch!) and asked Smart Connections chat for a summary of each chapter. The results were extremely good.

I then asked for a character analysis of each of the characters. Once again the results were very accurate. But this time instead of a piece of discursive prose I was given a structured text of 12 numbered points. It became clear that the Context Code Blocks for this kind of question were different from the Context Code Blocks for the summarising questions.

I am delighted by the results, but I would like to understand more about how the context works. Could you point me to an authoritative source, please?

One point, I’m not sure if I have managed to get the ADA embedding model to work. How would I check? Does it create a differently named .json file?

Smart Connections is a brilliant intervention and I am very happy to have joined the community.

Eamonn

brianpetro commented 7 months ago

@eamonnvi Thank you for your feedback and for sharing your use cases. I'm glad to hear that Smart Connections has been helpful for you.

Regarding your question about understanding how context works, context in Smart Connections is determined by the embeddings of the text. Embeddings are vector (numerical) representations of the text that capture its semantic meaning. The model uses these embeddings to calculate the similarity between different pieces of text.

In the case of your experiment with the novel chapters, in v1 it depends on how you asked the question due to the use of the HyDE method. It's kind of like a secondary search query being generated by GPT prior to the actual retrieval.

As for the ADA embedding model, it should create a .json file with the same name as the model. You can check if it's working by looking for this file in the .smart-connections directory. If the file is present, it indicates that the ADA model is being created successfully.

Good resources for learning about embeddings really depend on the specifics of what you're trying to achieve. Common formulas for calculating similarity are "dot product" and "cosine similarity", and linear algebra the relevant field of study. And there's a wide range of other relevant skills for utilizing embeddings that range from chunking methods to retrieval strategy.

Thank you for your kind words and support. I'm glad to have you as part of the Smart Connections community. If you have any more questions or need further assistance, feel free to ask.

Brian 🌴

brianpetro commented 7 months ago

Note: 2 possible methods of improvement:

Smart View filter: similar to brianpetro/obsidian-smart-connections#423, exclude or reduce weight of results containing specified headings, impacting results on a per-search basis. Filter may persist, but when removed the results go back to baseline.

Plugin-level configuration: settings allows exclusion of headings during the block-parsing process, impacting all results until exclusion is removed and blocks are re-embedded.