brianpetro / jsbrains

A collection of low-to-no dependency modules for building smart apps with JavaScript
https://jsbrains.org
MIT License
27 stars 7 forks source link

Smart Collections FR: Pinecone Adapter #4

Open arminta7 opened 1 year ago

arminta7 commented 1 year ago

Would it be possible to have the option to store the embeddings in Pinecone?

brianpetro commented 1 year ago

It's possible.

Integrating Pinecone would require:

I would consider doing this mainly because of performance, but calculating cosine similarity on my vault containing ~1,500 notes runs pretty smoothly at the moment.

Is the performance why you are asking about this? Or is there another reason?

Thanks!

arminta7 commented 1 year ago

My vault is about 20k notes. Part of it is performance. The other is being able to reuse the embeddings for other things rather than paying for the process multiple times.

brianpetro commented 1 year ago

20K is significantly more notes than I have tested with myself. Your embeddings.json file must be almost 2GB! And that's all pulled into memory, which would likely cause performance degradation on an average computer.

Regarding reusing the embeddings, the main issue with that is synchronization—the metadata limit for Pinecone is 10kb. So smaller notes will fit completely into the metadata at this limit, but not the larger notes (>~10,000 characters).

1) One way to get around this is to store references to the notes in the metadata (i.e. file.path), but that requires the "secondary" applications have access to your notes file system.

2) Another possibility is to limit all the embeddings to <10kb, which would decrease the maximum "chunk" sizes to about a third of what they are now (~8,000 tokens ~= 30,000 characters ~= 30kb). This way, you could avoid accessing your notes file system directly from "secondary" applications.

I feel option 2 goes against the Obsidian.md ethos of "owning your data" since all your notes would be hosted in the cloud.

Option 1 has its drawbacks, too. "Secondary" applications outside of Obsidian would be more difficult to develop. However, other Obsidian plugins (i.e., Smart Completions) will have no problem reusing the embeddings stored within Obsidian. So it depends on your use case.

What is the average number of notes in an Obsidian vault? If it's much more than what I've anticipated (<1000 notes), then I think option 1 could make sense for performance reasons. That said, performance has been an afterthought at this point. There is still likely a lot of low-hanging fruit in terms of performance that wouldn't require an additional API service provider.

I'm thinking out loud here, so any feedback would be appreciated.

Thanks!

arminta7 commented 1 year ago

Yes, the file is... unwieldy lol.

I don't know too much about the specifics of the different options. I know there's also something like Weaviate? Not sure if that's better. It is open source right?

Just checked and my largest note is ~4 million characters. And plenty of others over 10k.

As far as the average number of notes? I have no idea. I'm probably on the larger end, not the largest I've heard. I'm sure there are plenty over 1,000 notes.

brianpetro commented 1 year ago

Thanks for suggesting Weaviate. It's pretty comparable to Pinecone. Hosting your own instance looks non-trivial and may not be easily packed into the plugin. I'll have to look into it more before saying it for sure.

It needs further research, but there should be a relatively simple solution to manage the vector calculations better. The storage file can be separated based on a cosine similarity clustering algorithm. Then the calculations could be prioritized based on the nearest cluster. I'm surprised I haven't seen anything like this, but I haven't looked much.

I'll continue to look into this. Thanks for the feedback.

vguillet commented 1 year ago

I second this. Being able to pull embedding from pinecone would allow for potentialy leveraging purpose-made embedding tools capable of taking in a large variety of files for example (powerpoints/pdfs for example). This could in turn unlock better query responses while also keeping the base embedding repository across all tools leveraging personal data unique!

brianpetro commented 1 year ago

@vguillet I see you already commented on https://github.com/brianpetro/obsidian-smart-connections/issues/27 , thanks!

It's a similar idea. I still think a pinecone/weaviate integration will happen. But I need to learn more about how people are using them.

brianpetro commented 2 months ago

Recorded response https://youtu.be/J5ARc_91fzs