Better org-roam integration for org-similarity

Vidianos-Giannitsis commented 1 year ago

Hello,

I saw this package somewhere around Christmas and I really liked the idea behind it. What annoyed me was how it was based on files and not org-roam nodes.

I saw the latest commit was like a year or so ago, so I wasn't sure if you wanted to continue the project and thought of forking the code base to write some stuff to integrate it more tightly with org-roam. I did do some work on that and also wrote some documentation and stuff. There are still more ideas I have which I will work on when I have time, but this is what I have for now.

However, since then, I see you have started working on this package again and I also saw you talk about it on Mastodon. For that reason, I think there is no real point in me working on this alone, so I thought I should create this pull request for us to discuss.

In my files, I have all the variables prefixed by org-roam-similarity because of my initial ideas, but that's easily fixed. You can check all sections, but the most relevant are from here onward onward.

I would also love to help with documentation in this project.

For merging, I will need to sync my version with the master branch of org-similarity and then fix org-roam-similarity.el to not have anything already in org-similarity so they can co-exist and this can be sort of a subpackage of org-similarity.

Please tell me your opinion, Here for more discussion, Vidianos

brunoarine commented 1 year ago

Hello @Vidianos-Giannitsis !

Thank you so much for your ideas and your customization of org-similarity! I'm incredibly humbled by the fact that you really enjoyed this project! Here are a few preliminary considerations. Please correct me where needed.

(1) About making the package work with org-roam nodes, I can totally see how much value it could bring to org-roam users specifically, and I fully endorse your carrying out these additions you've proposed in the PR. (In fact, I'm one of those users who would benefit from such a thing.)

However, I'd rather keep org-similarity and org-roam-similarity as separate packages, and thus separate repositories. This is purely a personal choice, as I prefer to maximize the separation of concerns in order to increase code legibility and maintainability.

Somewhere in the future, we could turn org-similarity into an organization account on Github, so that related repositories could all sit under its umbrella (just like there is org-roam/org-roam, but also org-roam/org-roam-ui, org-roam-bibtex, and so on).

(2) That said, org-roam-similarity could use org-similarity as a base dependency, leaving you more room to focus on the node-centered implementation and leaving both codes more tidy in the process.

To help you and others with that, I think it would be a good idea to have a "core function" in org-similarity that spills out results in an array rather than printing them directly to the current buffer or side buffer. That way, others can adapt their derived packages more easily. I'm gonna add this to the backlog.

(3) With regard to caching the similarity results somewhere (as you've suggested here), I would advise against it. Every time you add or remove a word in either the current buffer or any document in the target directory, the corpus and/or the relative frequency of the tokens are changed as well. That's why the TF-IDF matrix is recalculated at runtime. You mentioned though that doing so is very slow. If you could tell me more about it, I can try to fix this issue, but it shouldn't take more than a second to process a thousand notes even on an old laptop.

(4) In the recent weeks I've added an option for org-similarity to generate lists of ID properties rather than filenames (though IDs links are used by some "org-mode purists", org-roam v2 users would greatly benefit from this feature). Please take a look in the develop branch, maybe the most recent refactorization can help you with your node functions.

(5) In the future, I intend to add a feature where the similarity search can optionally occur at heading level rather than file level. That feature, coupled with using org IDs, could potentially cover the org-roam use case entirely. However, I intend to perform a similarity search by walking recursively through specified org-mode directories. An org-roam-similarity package, on the other hand, should tap directly into emacsql and fetch org-roam nodes directly from its database in my opinion.

So despite keeping the two codes separate, contributions in the shape of pull requests are more than welcome to the org-similarity repo, provided they don't deviate much from the project's scope (which is developing a solution for org-mode files in general). Likewise, I'd be eager to help you with contributions to the org-roam-similarity repo.

At the moment, I'm focusing on the wrap-up for a major release (v1.0). This version should have a great number of bug fixes, improvements, new features, and a reasonable unit tests coverage. Once I have it ready, I'll make it available on MELPA, which is going to be one of the biggest milestones for the project. So stay tuned! :)

Vidianos-Giannitsis commented 1 year ago

Hello @brunoarine. Many apologies for the delayed response, its been a busy week for me.

However, I'd rather keep org-similarity and org-roam-similarity as separate packages, and thus separate repositories. This is purely a personal choice, as I prefer to maximize the separation of concerns in order to increase code legibility and maintainability. Somewhere in the future, we could turn org-similarity into an organization account on Github, so that related repositories could all sit under its umbrella (just like there is org-roam/org-roam, but also org-roam/org-roam-ui, org-roam-bibtex, and so on).

That's perfectly acceptable. I can make org-roam-similarity separate but dependent on org-similarity when I find time to work on it again.

To help you and others with that, I think it would be a good idea to have a "core function" in org-similarity that spills out results in an array rather than printing them directly to the current buffer or side buffer. That way, others can adapt their derived packages more easily. I'm gonna add this to the backlog. (4) In the recent weeks I've added an option for org-similarity to generate lists of ID properties rather than filenames (though IDs links are used by some "org-mode purists", org-roam v2 users would greatly benefit from this feature). Please take a look in the develop branch, maybe the most recent refactorization can help you with your node functions.

Combining these two points of yours, as the core function you are referring to need to be exactly this, a list of ID properties. The way I did that in my original implementation, is grab the files from the python script and find nodes from the files. This is to be fair a bit of a complicated function, but is possible. With a function that collects a list of IDs, it would be truly trivial to convert them to nodes. For context, if we define a low level function org-similarity--collect-ids to do this, this is two lines of code that does that:

(let ((id-list (org-similarity--collect-ids)))
  (mapcar #'org-roam-node-from-id idlist))

compared to the one I used which was 12 lines.

I will take a look at your develop branch to see more of it, but I think it is a great step in the right direction to be able to spit out a list of IDs from the python script for moving forward with the org-roam stuff.

With regard to caching the similarity results somewhere (as you've suggested here), I would advise against it

There is a chance it takes more time because of how I have done it for org-roam nodes. I will clone the latest version and try it again to tell you if that is so. The issue might not be in your stuff. In that case, if we use IDs instead of files, the function that takes the nodes will be blazing fast, because mapcar is pretty fast compared to all the stuff the current function is doing.

An org-roam-similarity package, on the other hand, should tap directly into emacsql and fetch org-roam nodes directly from its database in my opinion.

That's an interesting idea. The org-roam sql backend is truthfully one of the very few parts of org-roam I haven't used. It would be interesting to explore it more however.

Also, quick question if you have anything in your mind: Any other ideas for integration of org-roam and org-similarity. So far, my only ideas where org-roam-node-find style function that walks you through an org-similarity search first and an org-roam-node-insert style function that sorts nodes by putting the ones that are similar on the top of the menu. I had a thought of filtering by these nodes, which is almost done already if I want it, but I didn't feel it was that useful, because realistically you want to view all nodes, but sorting makes more sense. I hadn't found any solution for this, but I plan to look at it more. But besides those, I don't really have many other ideas, so I would love to hear any if you have.

That's all I have to say for now. I won't have much time to tinker with this in the coming days, however, when I do, I will definitely be looking at what you are working on and especially the ID stuff. Will try to respond earlier next time :D

brunoarine / org-similarity

Better org-roam integration for org-similarity #23