Binding data before a predicate

Kungsgeten commented 4 years ago

Happy new year! I've experimented with adding org-ql to org-brain, and it has been a huge performance boost. See commit here (if you're interested): https://github.com/Kungsgeten/org-brain/commit/31603865b4164b9f218cf2cb091cdb32e0d20aa2

I had an idea to provide org-ql predicates for some of org-brain specific things, in case users want to query their org-brain content. Here's an example:

(org-ql--defpred ob-child-of (entry-name)
  "Return non-nil if current heading is an org-brain child of ENTRY-NAME."
  (let ((children
         (org-brain-children
          (if-let ((entry (assoc entry-name
                                 (mapcan #'org-brain--file-targets (org-brain-files)))))
              (or (org-brain-entry-from-id (cdr entry))
                  (cdr entry))
            (error "No entry found with name %s" entry-name)))))
    (member (org-brain-entry-at-pt) children)))

This works, but is slow since the let-clause has to be run for every headline in the query. In reality the let-clause is only needed to be run once, before the search starts, and then the results could be used in each try.

It would be nice to have a way to bind data before the search starts, and have that data be available in the defpred. Perhaps something like this:

(org-ql--defpred ob-child-of (entry-name)
  :let ((children
         (org-brain-children
          (if-let ((entry (assoc entry-name
                                 (mapcan #'org-brain--file-targets (org-brain-files)))))
              (or (org-brain-entry-from-id (cdr entry))
                  (cdr entry))
            (error "No entry found with name %s" entry-name)))))
  "Return non-nil if current heading is an org-brain child of ENTRY-NAME."
  (member (org-brain-entry-at-pt) children))

It could be though that this won't be needed in any other use case, and thus doesn't fit into org-ql. Another posibility could be for the BUFFERS-OR-FILES argument in org-ql-select to also allow a list of markers to headlines.

akirak commented 4 years ago

Hello, I am not the author of this package, but it looks like this issue is about extending org-ql--pre-process-query. It is not really specific to org-brain, so I am now interested in this thread.

Alternatively, you might be able to improve the performance using memoization.

Kungsgeten commented 4 years ago

@akirak Thanks! I hadn't read about memoize before. The following changes made it run much faster:

(defun org-brain-entry-from-name (entry-name)
  "Return entry with ENTRY-NAME."
  (if-let ((entry (assoc entry-name
                         (mapcan #'org-brain--file-targets (org-brain-files)))))
      (or (org-brain-entry-from-id (cdr entry))
          (cdr entry))
    (error "No entry found with name %s" entry-name)))

(memoize 'org-brain-entry-from-name 10)

(org-ql--defpred ob-child-of (entry-name)
  "Return non-nil if current heading is an org-brain child of ENTRY-NAME."
  (member (org-brain-entry-at-pt)
          (org-brain-children (org-brain-entry-from-name entry-name))))

alphapapa commented 4 years ago

Hi Erik,

Aha, welcome to the (or light dark) side! :) Haha, seriously though, I'm glad that org-ql is useful in org-brain.

Regarding the idea of binding data around a query: that's an idea I've thought about, although I can't seem to find any notes I've made about it. Generally, I think it would be helpful for data that is specific to each buffer searched in a query, like per-buffer to-do keywords. Caching data like that would avoid having to retrieve it every time the predicate is called.

For your specific use case, I think it's not necessary. I looked at your code and I think you can easily optimize it with a few changes. Here are some thoughts:

You shouldn't need to define a new predicate--and in fact, doing so is suboptimal, because the defpred macro doesn't define a "preamble" regexp, which prevents the biggest optimization that org-ql can do. (Ideally the macro could also be used to define a "preamble", but I think that wouldn't be feasible, because it would require refactoring how the --query-preamble function works, which would raise other issues. Let's say that that's a long-term idea that might be possible someday.)
Instead, your code should define queries that use existing predicates as much as possible, which org-ql can optimize better.

For example, instead of defining this predicate:

(org-ql--defpred org-brain ()
  "Return non-nil if entry could be a part of `org-brain'.
Entry should have an ID and `org-brain-entry-at-point-excludedp' should return nil."
  (and (org-entry-get (point) "ID")
       (not (org-brain-entry-at-point-excludedp))))

Which you then use in org-brain-headline-entries like:

(defun org-brain-headline-entries ()
  "Get all org-brain headline entries."
  (org-ql-select (org-brain-files) '(org-brain)
    :action #'org-brain--headline-entry-at-point))

Use existing predicates in a query, like:

(defun org-brain-headline-entries ()
  "Get all org-brain headline entries."
  (org-ql-select (org-brain-files)
    '(and (property "ID")
          (not (org-brain-entry-at-point-excludedp)))
    :action #'org-brain--headline-entry-at-point))

That should be faster, because the property predicate can be optimized to a whole-buffer regexp search. Of course, it depends on how many entries in a file have an ID property: if only some of them do, it will skip checking ones that don't; but if all of them do, it will have to check them all anyway. If it happened that org-brain IDs had a certain structure or prefix, you could write a query to jump directly to potential org-brain IDs, e.g. something like:

(and (regexp "^:ID: org-brain-")
     (property "ID")  ; Ensure the match is actually a property in a drawer
     (not (org-brain-entry-at-point-excludedp)))

Also, when possible, prefer to do as much work as possible in the :action function, rather than doing, e.g. mapcar across a list of results like you do here: https://github.com/Kungsgeten/org-brain/commit/31603865b4164b9f218cf2cb091cdb32e0d20aa2#diff-60847823822d6fb59f29ef038133d89fR701 In that example, you could do the work of the mapcar's lambda in the action function, which avoids consing another list.

When you must call custom functions in your queries, call them through org-ql--value-at when appropriate, which provides per-node caching. For example, you might benefit from calling org-brain--headline-entry-at-point through it, and likely org-brain-entry-at-point-excludedp as well (assuming they tend to be called repeatedly for a headline before the buffer changes).

Do you understand what I mean? Let me know if I should explain more or more clearly. :)

BTW, regarding memoize: I can't seem to find the relevant discussions right now, but Chris has said that he generally favors using purpose-built caching functions over memoize nowadays, because they can be much faster. That's why I wrote org-ql--value-at instead of using memoize. That's not to say that you shouldn't use memoize, but it's something to be aware of.

alphapapa commented 4 years ago

Also, @akirak: Yes, ideally --pre-process-query could be used to, e.g. replace (org-brain) with one of the query examples I gave. However, that function isn't designed to be extensible like that. I've thought about it, but it would be difficult, especially since the order of the clauses matters. I could imagine e.g. a defvar with a long list of lambdas which are applied in order, but that would also make debugging more difficult because the code would be spread out among lambdas stored in a variable, defined in macro calls spread out in the file, instead of being together in a single function. So there are tradeoffs to both designs. In the long term, I'd like to have a more flexible design, but it needs to be carefully done.

akirak commented 4 years ago

@alphapapa Thank you for the comment. I will keep it in mind.

Also, I liked memoization because memoize.el is so easy to use and convenient, but I didn't know of that discussion. Thank you for the info.

alphapapa commented 4 years ago

Ah, here's the note I was thinking of: https://github.com/alphapapa/org-ql/blob/d28d54c58842f0d31e40e8d21337360c9c238ba0/helm-org-ql.el#L212

Kungsgeten commented 4 years ago

I've been sitting for some hours now trying different ways to make org-brain-headline-entries faster, using org-ql. The first time the function is run, it takes quite a lot of time, and then it is very fast (thanks to the caching, I guess).

What I realized though was that it didn't matter much what I put into the query or action, it took about the same time. The most stripped down version I've tried is this:

(defun org-brain-headline-entries ()
  "Get all org-brain headline entries."
  (org-ql-select (org-brain-files)
    '(property "ID")
    :action '(identity 1)))

This takes about 20 seconds the first time it is run. It scans 25 files and finds 224 matches. Is that the expected speed? I have only tried it on my Windows config. During the first run I get lots of messages in the mini-buffer:

=> org-brain-headline-entries
Org mode fontification error in #<buffer clojure.org> at 122
Setting up indent for shell type sh
Indentation variables are now local.
Indentation setup for shell type sh
Org mode fontification error in #<buffer emacs.org> at 171
Org mode fontification error in #<buffer emacs.org> at 248
Org mode fontification error in #<buffer funcs.org> at 4
Org mode fontification error in #<buffer programmering.org> at 227
Org mode fontification error in #<buffer programmering.org> at 247
Org mode fontification error in #<buffer programmering.org> at 287
Showing all blocks ... done
Org mode fontification error in #<buffer programmering.org> at 305
Showing all blocks ... done
Org mode fontification error in #<buffer programmering.org> at 318
Showing all blocks ... done
Org mode fontification error in #<buffer programmering.org> at 324
Showing all blocks ... done
Org mode fontification error in #<buffer programmering.org> at 336
Showing all blocks ... done
Org mode fontification error in #<buffer programmering.org> at 363
Org mode fontification error in #<buffer python.org> at 55
Can’t guess python-indent-offset, using defaults: 4
Org mode fontification error in #<buffer python.org> at 92
org-ql: No headings in buffer: clojure.org.org
org-ql: No headings in buffer: ekonomi.org.org
org-ql: No headings in buffer: funcs.org
org-ql: No headings in buffer: kanske.org
org-ql: No headings in buffer: nytt.org
org-ql: No headings in buffer: r3.org

alphapapa commented 4 years ago

No, 25 files and 224 matches should be collected much faster than that. That fontification error appears to come from org-fontify-meta-lines-and-blocks, and I'm guessing Emacs is pausing for each error message, which is artificially inflating the runtime (I see a similar behavior when I'm running an Org Agenda command and an invalid diary sexp causes errors/warnings, which Emacs pauses to display repeatedly while I'm waiting for the agenda to appear...grrr).

It's hard to say what is causing the fontification errors. Once in a while I see those errors in my own config when doing random Org-related things. You might have some Org files with invalid syntax, so you might try using org-lint on them. If that doesn't help, try to reproduce them with emacs -q, to ensure the problem isn't something in your config. (Or you could use https://github.com/alphapapa/emacs-sandbox.sh instead of emacs -q, so you wouldn't have to set up the package system yourself.)

Kungsgeten commented 4 years ago

I deactivated a package named org-variable-pitch, which got rid of the fontification errors. That got me down to 13 seconds. I also tried to remove the files which org-ql complained about (almost empty files). That got me down to 10 seconds. I then tried the following change to org-brain-headline-entries:

(defun org-brain-headline-entries ()
  "Get all org-brain headline entries."
  (org-ql-select (mapcar (lambda (x)
                           (with-current-buffer (create-file-buffer x)
                             (insert-file-contents x)
                             (delay-mode-hooks (org-mode))
                             (current-buffer)))
                         (org-brain-files))
    '(property "ID")
    :action '(identity 1)))

This gets me down to 0.13 seconds! I've had similar problems during the development of org-brain, that using find-file-noselect is slow. In order for caching to work, and to not open endless buffers, I also tried this, which seems to work well:

(defun org-brain-headline-entries ()
  "Get all org-brain headline entries."
  (org-ql-select (mapcar (lambda (x)
                           (or (find-buffer-visiting x)
                               (with-current-buffer (create-file-buffer x)
                                 (insert-file-contents x)
                                 (delay-mode-hooks (org-mode))
                                 (setq buffer-file-name x)
                                 (not-modified)
                                 (current-buffer))))
                         (org-brain-files))
    '(property "ID")
    :action '(identity 1)))

The downside here is ofcourse that the mode-hooks aren't run. If I enable them I get about 1.9 seconds on the first call.

alphapapa commented 4 years ago

Yeah, it sounds like your config has Org doing a lot of initialization in the mode hooks. Even some built-in Org modules can be slow to initialize in large files, like org-indent-mode. So if those files aren't already open in buffers, org-ql will open them to search them, and that will delay the search. So what you're measuring here is not just org-ql's performance but how fast your Org configuration initializes an Org buffer, which also depends on the filesize, number of headings, etc.

So I'd recommend testing by writing a script to do something like this:

(mapc #'find-file-noselect (org-brain-files)). I think find-file-noselect won't return until the mode hooks are done, but you should verify that.
(org-ql-select ....

Also, I recommend using this macro for your benchmarking: https://github.com/alphapapa/emacs-package-dev-handbook#bench-multi-lexical It lets you compare forms, and it uses lexical-binding, which can make a big difference (and you always want your "production" code to use lexical-binding anyway).

Kungsgeten commented 4 years ago

There's also a lot of stuff in find-file-hook etc. Thank you for the tip regarding your development guide. A small question regarding measuring performance: How do you remove the org-ql cache? At the moment I'm restarting Emacs each time I want to check the speed.

alphapapa commented 4 years ago

The 3 caches are initialized here: https://github.com/alphapapa/org-ql/blob/fb830f8577bb6287d899948f696e33c96e877a40/org-ql.el#L85 You can reset them by re-evaluating those defvar forms (make sure to do all 3), or the equivalent.

alphapapa / org-ql

Binding data before a predicate #88