brianpetro / obsidian-smart-connections

Chat with your notes & see links to related content with AI embeddings. Use local models or 100+ via APIs like Claude, Gemini, ChatGPT & Llama 3
https://smartconnections.app
GNU General Public License v3.0
2.54k stars 172 forks source link

Exclusions folder seems will be included in smart-connections #791

Closed Ocean-Tang closed 6 hours ago

Ocean-Tang commented 5 days ago

There is a problem like the title say.

brianpetro commented 5 days ago

@Ocean-Tang thanks for bringing this to my attention 😊

It should be fixed in v2.2.75 🌴

jwhco commented 5 days ago

This didn't seem to be broken before. The .smart-connections folder had a multi directory. Now that jumped to .smart-env which seems like an unnecessary move.

image

@brianpetro It looks like things jump around. Now that you have a stable product do you have a launch plan, or growth strategy so you aren't flooded with requests?

brianpetro commented 5 days ago

@jwhco trust the process 🧘🌴

brianpetro commented 5 days ago

@jwhco the community organized these Lean Coffee chats so you have the opportunity to ask questions like those πŸ“… https://x.com/wfhbrian/status/1836531606569250983 🌴

SinewaveLaboratorium commented 2 days ago

Oops, there goes all my private notes to OpenAI. I knew it was just a matter of time.

brianpetro commented 1 day ago

fwiw they do claim that data sent to the API is not used for training purposes.

But the concern is also why I have a local embedding model set as the default for new users 🌴

jwhco commented 22 hours ago

API is not used for training purposes.

It's all used for training. My vault contains specific insights designed to taint the model.

Within a month, I retrieved these insights from a different session in the same model.

Of course, since I know the insights are false, I ignore them when they arise in the local chat.

My theory is that the LLM cannot tell if something is false if it is told the fact is true.

SinewaveLaboratorium commented 12 hours ago

@brianpetro, I know they state that and also that they only retain the data in their servers for 30 days, which is why I am (was) not too concerned, however that may change after the allegations from @jwhco. To him, I ask, can you explain how could this be replicated and tested?

If I understand it correctly, you

  1. provide false insights from your vault to OpenAI's model via an embedding API using Smart Connections?
  2. And then within a month you can retrieve those false insights how? Just using normal ChatGPT, the Playground? I assume you don't mean using Smart Connections chat, because that uses your notes as context, so obviously it will be able to retrieve the false insights, without model training being required.

I retrieved these insights from a different session in the same model.

What does "the same model" means? Embedding models (e.g. text-embedding-3-large) are different from the regular chat models (e.g. gpt-4-1106). You cannot query an embedding model, so when retrieving info, how can you be using the same model? Same model as what?

Just trying to understand, this is important for everyone. Cheers.

jwhco commented 7 hours ago

TL;DR; If you have a budget for a consultant, I can discuss the details. Otherwise, don't use private or proprietary data with an LLM you don't ownβ€”or at least without controlling for inadvertent leaks.

At a high level, I'm curating a new association for the LLM with made-up but authoritative information. Within that same LLM, I'll be able to recall that new association in a chat session with a different user on the same LLM.

False insights are like new characters in a story. In Smart Connections, I expect it to "learn" from notes within the local session. What would be undesirable is a proprietary note to end up outside my Obsidian vault.

I didn't expect a model that says it doesn't learn from chat to recall something curated (beyond the session.) It's more complex than I can explain here. You also have to be careful with confirmation bias.

My background is in risk management and business development. I was trying to develop a way to control for models that steal intellectual property. My theory is that the LLM retains those details once it confirms something is true.

My claim companies are training (or at least validating) with chat interactions, and this comes from a lot of testing. What works in one model doesn't work in another. I've only looked at a few on the leaderboard at https://lmarena.ai/

There is a lot of pressure to reduce hallucinations and find more training data. I'm seeing expected behavior (assuming the model wants to learn.) I'm curious if data leaks or leaks into the LLM.

I write a lot of case studies for my consulting work. One character in my vault doesn't exist outside of that vault. It's a character, a named buyer persona, and the story around them is based on several real people.

Give them a unique yet believable name. I use a name generator, or, just off the top of my head, let's say that name is John Galt, the love interest of Dagny Taggart. (If I gave you one of these checksum characters, I invalidate my control.)

If I prompt "Who is John Galt?" in a search engine, AI chat, or Chatbot Arena, I don't expect to get specific details I made up. Most of the time, I get nothing or a huge hallucination. LLMs always give you a response, even if it is wrong.

Sometimes, you must prompt, "Who is John Galt, Dagny Taggart's love interest?" to provide context. A third prompt in the test might be "Tell me something unique or unusual about John Galt."

I am looking for a first-date location, a new love interest, or another context that doesn't exist outside my notes. If this was a real person, there is no way to control legitimate training data from adding that context.

Several narratives, versions of case studies, and facts about lists might surround that character. Because I don't use the name outside of the chat session and vault, I don't expect there to be any associations to recall.

If I don't see what I was looking for, any more detail could cause a hallucination in my bias. I check the model I use via OpenRouter or across sessions of other LLMs I'm using. While writing this, I checked llama-3.1.8b-instruct and got inferred "Professional Background", "Expertise", and "Presence", but mostly "Unfortunately, I couldn't find extensive information about this person."

I have a good hit rate when dropping something into ChatGPT and picking it up on the same model on LMSYS (now Chatbot Arena) or a different user session on ChatGPT. However, I'm not writing test cases.

As soon as this was consistent for a few LLMs, I used it to curate and correct details about clients and companies. From a business development perspective, getting a desired response to a simple question is valuable. Many factors contribute to bias in models.

After all, ideas are not unique. Even in the world of intellectual property, there is a race to monetize before the competition figures it out. I learned that asking the LLM about something proprietary increases the chance it will infer details you want to be hidden.

It's either learning from the prompts, a clue from some other interaction, or there is a match due to cognitive bias. I'd have what's proprietary in a document management system, not a plain text note-taking application.

brianpetro commented 6 hours ago

Closing since the original issue is fixed 🌴