Closed NightMachinery closed 3 years ago
org-ql is slow, and not usable on large repositories of org-mode files
Could you elaborate? How many heading/files do you have? How slow is slow? I have >26k headings and can search them almost instantly.
does it support querying siblings?
Not now, but it is not too hard to support.
I like to see what such an information retrieval system would look like without being tied to org-mode
That's a well-established field. Database queries, DOM queries, XML, json, etc
@yantar92 commented on Oct 7, 2021, 9:26 AM GMT+3:30:
Could you elaborate? How many heading/files do you have? How slow is slow? I have >26k headings and can search them almost instantly.
Well, here is my notes directory:
❯ tokei
===============================================================================
Language Files Lines Code Comments Blanks
===============================================================================
Clojure 6 133 117 3 13
CSS 36 57704 49932 1469 6303
Dhall 1 8 7 1 0
Edn 3 88 56 25 7
Emacs Lisp 2 88 71 16 1
INI 1 1157 834 0 323
JavaScript 13 6361 5700 305 356
JSON 35 58729 58728 0 1
Julia 1 27 20 2 5
Markdown 327 3121 0 2184 937
Org 2538 114042 87324 425 26293
PHP 4 122 114 0 8
Python 4 38 22 8 8
Shell 1 8 5 1 2
SVG 1 1 1 0 0
Plain Text 13 9925 0 7802 2123
TOML 1 838 69 471 298
YAML 2 210 123 55 32
Zsh 7 52 40 8 4
-------------------------------------------------------------------------------
HTML 15 24879 24128 130 621
|- CSS 13 107 105 0 2
|- JavaScript 5 152 128 13 11
(Total) 25138 24361 143 634
===============================================================================
Total 3011 277790 227524 12918 37348
===============================================================================
Running a simple query such as
(org-ql-search (org-ql-search-directories-files
:directories (list (getenv "nightNotes"))
:recurse t)
'(tags "great"))
will take an awful amount of time (I canceled the command just now after waiting for some minutes), and will open all the files as buffers which will clutter my emacs. Doing this same search using grep
takes around one second.
And this was just my (mostly) manually taken notes directory. If I invoke the command on scraped content, I can only dream of ever getting a response!
elisp
sucks. I recently adapted an elisp
script to scrape Hacker News to org-mode files, and it is so slow it takes around 20 minutes to download a single story. (It was not a concurrent implementation, as elisp
concurrency is rather arcane.)
does it support querying siblings?
Not now, but it is not too hard to support.
Really? I thought about this some more yesterday, and sibling queries don't seem to provide much value, so I just gave up on them, but supporting them seems pretty difficult to me. How can a query such as
* root
*level=x* A
**level=x+y* C
*level=x* B
**level=x+y+3* D
Query language guide:
*level=x*: a child at the level `x`
be supported? It seems to me that the complexity in handling such a query is not too inferior to a regex engine in general. Perhaps a Prolog-style implementation?
I like to see what such an information retrieval system would look like without being tied to org-mode
That's a well-established field. Database queries, DOM queries, XML, json, etc
Can you show me some examples of the above query I have written being matched against JSON? I haven't really seen any complex query languages for JSON.
CSS selectors seem the closest thing to what I envision. Do you know of any (high-level) references on how they are implemented? Is there some technical jargon for this style of queries that I can search in Google Scholar?
Looking more at CSS selectors, they too seem deliberately limited to make the implementation easier. E.g., the sibling selector only checks for elements after the current one, and not before. This means that one needs to know the order of the siblings to match successfully.
Well, here is my notes directory: ... Total 3011 277790 227524 12918 37348
I do not really use many small Org files, but I have ~370k lines in my notes.org
. Searching is pretty fast.
Running a simple query such as
(org-ql-search (org-ql-search-directories-files
:directories (list (getenv "nightNotes"))
:recurse t)
'(tags "great"))
That's not a simple query... You cannot even search such query using grep
. (tags "great")
is trying to match inherited tags. No doubt it is not as fast a simple regexp search via grep
. If you want a real comparison, try the following:
(org-ql-query :from (org-ql-search-directories-files
:directories (list (getenv "nightNotes"))
:recurse t)
:where '(tags-local "great")
:select '(point))
Similar command should also work fast on scraped content. Regexp search is fast in Elisp (it is implemented in C after all).
If you want faster complex queries, regexps will simply not cut it. We must operate on specialised data structures/databases. I have made some progress on it using extended org-element-cache [1] and org-ql [2]. At least, inherited tag searches can work much faster now. More to come in future.
[1] https://github.com/yantar92/org
[2] https://github.com/yantar92/org-ql
- root level=x A *level=x+y C
level=x B *level=x+y+3 D
If you want to match C/D:
(and (level x+y) C (ancestors '(and (level x) A)))
(and (level x+y+3) D (ancestors '(and (level x) B)))
It seems to me that the complexity in handling such a query is not too inferior to a regex engine in general. Perhaps a Prolog-style implementation?
Note that org-ql
is not implemented purely using regexp. More complex predicates use arbitrary lookups utilising Org API - we can refer to parents, children, siblings of any given heading when checking for match.
Do you know of any (high-level) references on how they are implemented? Is there some technical jargon for this style of queries that I can search in Google Scholar?
(and (level x+y) C (ancestors '(and (level x) A)))
(and (level x+y+3) D (ancestors '(and (level x) B)))
@yantar92 Note that x
and y
are not instantiated variables; They are just the same value in the whole query. (I.e., we do not know or care what x
is, it can be any value.)
elisp sucks. I recently adapted an elisp script to scrape Hacker News to org-mode files, and it is so slow it takes around 20 minutes to download a single story. (It was not a concurrent implementation, as elisp concurrency is rather arcane.)
This is slightly offtopic, but...
You are probably doing something strange. I am using a web scraper library written purely in Elisp [1]. The library does not just download pages, but also parses them for useful metadata. I am even able to scrape all videos from a YouTube channel. The speeed is comparable with what I have seen in Python scrapers.
Note that x and y are not instantiated variables; They are just the same value in the whole query. (I.e., we do not know or care what x is, it can be any value.)
Then the first one is even easier
(and C (ancestors 'A))
The second is more tricky and may require a custom parentN
predicate
(and D (parent '(parent '(parent 'B))))
or
(and D (parentN 3 'B))
Note that x and y are not instantiated variables; They are just the same value in the whole query. (I.e., we do not know or care what x is, it can be any value.)
Then the first one is even easier
(and C (ancestors 'A))
The second is more tricky and may require a custom
parentN
predicate(and D (parent '(parent '(parent 'B))))
or
(and D (parentN 3 'B))
'x' should be the same (unspecified) value in the whole query. (I.e., A and B should be siblings.)
'x' should be the same (unspecified) value in the whole query. (I.e., A and B should be siblings.)
I see. Then, you will need the new sibling
predicate. Query may (depending how to implement the predicate) look like:
(and (parent '(parent '(parent '(and B (sibling A)))))
(or (and C (sibling 'D))
(and D (sibling 'C))))
P.S. Actually, your query will match nothing. Sibling nodes cannot have both A and B as parents.
Ok... I missed that A and B should be siblings, not C and D. My SQL fu is not good enough to construct such a complex request within a simple query language. You need a custom predicate utilising Elisp: (1) Match C; (2) According to list of matching C headings, construct a new query for D.
@yantar92 Thanks for taking the initiative to respond to this issue. :)
@NightMachinary As you've noticed, it will take Emacs a long time to open 3,000 Org files. Most of that time is spent initializing org-mode
in a buffer. There's another issue/PR here that explores deferring some mode hooks, which could improve that. But there's only so much that can be done.
Once a file is opened in a buffer, it can be searched quickly. For example, there's a user of org-rifle
who reported that he uses tens of thousands of Org files as a kind of database for a museum that he works at, and he uses org-rifle
to search it, and once all of the files are open in Emacs, it searches quickly.
If you don't want to have that many Org files open in Emacs--I wouldn't--then I'd suggest the old, "Doctor, it hurts when I..." approach. Org generally works better with fewer, smaller files, and org-ql
and org-rifle
are designed to treat headings as entries to be returned as search results, not files. I recommend thinking of an Org file as a notebook, not as a single sheet of paper (imagine how impractical it would be to search through 3,000 loose sheets of paper in a drawer).
For more information, please refer to the years of discussions on this topic in places like r/emacs and r/orgmode, several of which I've posted in.
@alphapapa (Feel free to close the issue.)
My goal in opening it was not to discuss org-ql
, I am just looking for prior art on implementing a hierarchical query system. I am already quite keen on writing a personal information retrieval system.
Basically, if you wanted to rewrite org-ql
in another language, which language, libraries, etc would you use? How will you index the data (what data structure?)? What will your query language look like?
I don't know. My having written org-ql
doesn't mean that I'm an expert on these things. I'd probably use a Lisp, because that's generally my preference. I'd probably use SQLite for the backend, as long as it worked reasonably well, because it's a proven, reliable system. Other than that, I can't offer any advice, beyond studying applicable literature and examples.
If you're after discussion, you'd probably get better responses on r/emacs or r/orgmode.
org-mode is a tree; To search it effectively, we need some information retrieval system that is aware of hierarchical relationships between its source documents.
org-ql
has some such queries:descendents
,children
, etc. But I am interested in seeing what the broader field of information retrieval has to offer on this, for three reasons:org-ql
is slow, and not usable on large repositories of org-mode filesorg-ql
seems to only have scratched the surface, for example, does it support querying siblings?does not match * root **** A ** _ * B