[DISCUSSION] What prior research is there on hierarchical document models (ala org-mode's headings) in information retrieval?

NightMachinery commented 3 years ago

org-mode is a tree; To search it effectively, we need some information retrieval system that is aware of hierarchical relationships between its source documents.

org-ql has some such queries: descendents, children, etc. But I am interested in seeing what the broader field of information retrieval has to offer on this, for three reasons:

org-ql is slow, and not usable on large repositories of org-mode files
org-ql seems to only have scratched the surface, for example, does it support querying siblings?
```
*** =root >down> (A AND-SIBLINGS B)=
**** matches
***** root
****** A
****** B
```

does not match * root **** A ** _ * B


- I like to see what such an information retrieval system would look like without being tied to org-mode

Does anyone have any useful references to share? Or is this a virtually untapped territory?

yantar92 commented 3 years ago

org-ql is slow, and not usable on large repositories of org-mode files

Could you elaborate? How many heading/files do you have? How slow is slow? I have >26k headings and can search them almost instantly.

does it support querying siblings?

Not now, but it is not too hard to support.

I like to see what such an information retrieval system would look like without being tied to org-mode

That's a well-established field. Database queries, DOM queries, XML, json, etc

NightMachinery commented 3 years ago

@yantar92 commented on Oct 7, 2021, 9:26 AM GMT+3:30:

Could you elaborate? How many heading/files do you have? How slow is slow? I have >26k headings and can search them almost instantly.

Well, here is my notes directory:

❯ tokei
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Clojure                 6          133          117            3           13
 CSS                    36        57704        49932         1469         6303
 Dhall                   1            8            7            1            0
 Edn                     3           88           56           25            7
 Emacs Lisp              2           88           71           16            1
 INI                     1         1157          834            0          323
 JavaScript             13         6361         5700          305          356
 JSON                   35        58729        58728            0            1
 Julia                   1           27           20            2            5
 Markdown              327         3121            0         2184          937
 Org                  2538       114042        87324          425        26293
 PHP                     4          122          114            0            8
 Python                  4           38           22            8            8
 Shell                   1            8            5            1            2
 SVG                     1            1            1            0            0
 Plain Text             13         9925            0         7802         2123
 TOML                    1          838           69          471          298
 YAML                    2          210          123           55           32
 Zsh                     7           52           40            8            4
-------------------------------------------------------------------------------
 HTML                   15        24879        24128          130          621
 |- CSS                 13          107          105            0            2
 |- JavaScript           5          152          128           13           11
 (Total)                          25138        24361          143          634
===============================================================================
 Total                3011       277790       227524        12918        37348
===============================================================================

Running a simple query such as

(org-ql-search (org-ql-search-directories-files
                   :directories (list (getenv "nightNotes"))
                   :recurse t)
     '(tags "great"))

will take an awful amount of time (I canceled the command just now after waiting for some minutes), and will open all the files as buffers which will clutter my emacs. Doing this same search using grep takes around one second.

And this was just my (mostly) manually taken notes directory. If I invoke the command on scraped content, I can only dream of ever getting a response!

elisp sucks. I recently adapted an elisp script to scrape Hacker News to org-mode files, and it is so slow it takes around 20 minutes to download a single story. (It was not a concurrent implementation, as elisp concurrency is rather arcane.)

does it support querying siblings?

Not now, but it is not too hard to support.

Really? I thought about this some more yesterday, and sibling queries don't seem to provide much value, so I just gave up on them, but supporting them seems pretty difficult to me. How can a query such as

* root
*level=x* A
**level=x+y* C

*level=x* B
**level=x+y+3* D

Query language guide:
*level=x*: a child at the level `x`

be supported? It seems to me that the complexity in handling such a query is not too inferior to a regex engine in general. Perhaps a Prolog-style implementation?

I like to see what such an information retrieval system would look like without being tied to org-mode

That's a well-established field. Database queries, DOM queries, XML, json, etc

Can you show me some examples of the above query I have written being matched against JSON? I haven't really seen any complex query languages for JSON.

CSS selectors seem the closest thing to what I envision. Do you know of any (high-level) references on how they are implemented? Is there some technical jargon for this style of queries that I can search in Google Scholar?

NightMachinery commented 3 years ago

Looking more at CSS selectors, they too seem deliberately limited to make the implementation easier. E.g., the sibling selector only checks for elements after the current one, and not before. This means that one needs to know the order of the siblings to match successfully.

yantar92 commented 3 years ago

Well, here is my notes directory: ... Total 3011 277790 227524 12918 37348

I do not really use many small Org files, but I have ~370k lines in my notes.org. Searching is pretty fast.

Running a simple query such as

(org-ql-search (org-ql-search-directories-files
                   :directories (list (getenv "nightNotes"))
                   :recurse t)
     '(tags "great"))

That's not a simple query... You cannot even search such query using grep. (tags "great") is trying to match inherited tags. No doubt it is not as fast a simple regexp search via grep. If you want a real comparison, try the following:

(org-ql-query :from (org-ql-search-directories-files
                       :directories (list (getenv "nightNotes"))
                       :recurse t)
              :where '(tags-local "great")
              :select '(point))

Similar command should also work fast on scraped content. Regexp search is fast in Elisp (it is implemented in C after all).

If you want faster complex queries, regexps will simply not cut it. We must operate on specialised data structures/databases. I have made some progress on it using extended org-element-cache [1] and org-ql [2]. At least, inherited tag searches can work much faster now. More to come in future.

[1] https://github.com/yantar92/org

[2] https://github.com/yantar92/org-ql

root level=x A *level=x+y C

level=x B *level=x+y+3 D

If you want to match C/D:

(and (level x+y) C (ancestors '(and (level x) A)))

(and (level x+y+3) D (ancestors '(and (level x) B)))

It seems to me that the complexity in handling such a query is not too inferior to a regex engine in general. Perhaps a Prolog-style implementation?

Note that org-ql is not implemented purely using regexp. More complex predicates use arbitrary lookups utilising Org API - we can refer to parents, children, siblings of any given heading when checking for match.

Do you know of any (high-level) references on how they are implemented? Is there some technical jargon for this style of queries that I can search in Google Scholar?

https://stackoverflow.com/questions/4048151/what-are-the-options-for-storing-hierarchical-data-in-a-relational-database

NightMachinery commented 3 years ago

(and (level x+y) C (ancestors '(and (level x) A)))

(and (level x+y+3) D (ancestors '(and (level x) B)))

@yantar92 Note that x and y are not instantiated variables; They are just the same value in the whole query. (I.e., we do not know or care what x is, it can be any value.)

yantar92 commented 3 years ago

elisp sucks. I recently adapted an elisp script to scrape Hacker News to org-mode files, and it is so slow it takes around 20 minutes to download a single story. (It was not a concurrent implementation, as elisp concurrency is rather arcane.)

This is slightly offtopic, but...

You are probably doing something strange. I am using a web scraper library written purely in Elisp [1]. The library does not just download pages, but also parses them for useful metadata. I am even able to scrape all videos from a YouTube channel. The speeed is comparable with what I have seen in Python scrapers.

[1] https://github.com/yantar92/org-capture-ref/

yantar92 commented 3 years ago

Note that x and y are not instantiated variables; They are just the same value in the whole query. (I.e., we do not know or care what x is, it can be any value.)

Then the first one is even easier

(and C (ancestors 'A))

The second is more tricky and may require a custom parentN predicate

(and D (parent '(parent '(parent 'B))))

or

(and D (parentN 3 'B))

NightMachinery commented 3 years ago

Note that x and y are not instantiated variables; They are just the same value in the whole query. (I.e., we do not know or care what x is, it can be any value.)

Then the first one is even easier
(and C (ancestors 'A))
The second is more tricky and may require a custom parentN predicate
(and D (parent '(parent '(parent 'B))))
or

(and D (parentN 3 'B))

'x' should be the same (unspecified) value in the whole query. (I.e., A and B should be siblings.)

yantar92 commented 3 years ago

'x' should be the same (unspecified) value in the whole query. (I.e., A and B should be siblings.)

I see. Then, you will need the new sibling predicate. Query may (depending how to implement the predicate) look like:

(and (parent '(parent '(parent '(and B (sibling A)))))
     (or (and C (sibling 'D))
      (and D (sibling 'C))))

yantar92 commented 3 years ago

P.S. Actually, your query will match nothing. Sibling nodes cannot have both A and B as parents.

yantar92 commented 3 years ago

Ok... I missed that A and B should be siblings, not C and D. My SQL fu is not good enough to construct such a complex request within a simple query language. You need a custom predicate utilising Elisp: (1) Match C; (2) According to list of matching C headings, construct a new query for D.

alphapapa commented 3 years ago

@yantar92 Thanks for taking the initiative to respond to this issue. :)

@NightMachinary As you've noticed, it will take Emacs a long time to open 3,000 Org files. Most of that time is spent initializing org-mode in a buffer. There's another issue/PR here that explores deferring some mode hooks, which could improve that. But there's only so much that can be done.

Once a file is opened in a buffer, it can be searched quickly. For example, there's a user of org-rifle who reported that he uses tens of thousands of Org files as a kind of database for a museum that he works at, and he uses org-rifle to search it, and once all of the files are open in Emacs, it searches quickly.

If you don't want to have that many Org files open in Emacs--I wouldn't--then I'd suggest the old, "Doctor, it hurts when I..." approach. Org generally works better with fewer, smaller files, and org-ql and org-rifle are designed to treat headings as entries to be returned as search results, not files. I recommend thinking of an Org file as a notebook, not as a single sheet of paper (imagine how impractical it would be to search through 3,000 loose sheets of paper in a drawer).

For more information, please refer to the years of discussions on this topic in places like r/emacs and r/orgmode, several of which I've posted in.

NightMachinery commented 3 years ago

@alphapapa (Feel free to close the issue.)

My goal in opening it was not to discuss org-ql, I am just looking for prior art on implementing a hierarchical query system. I am already quite keen on writing a personal information retrieval system.

Basically, if you wanted to rewrite org-ql in another language, which language, libraries, etc would you use? How will you index the data (what data structure?)? What will your query language look like?

alphapapa commented 3 years ago

I don't know. My having written org-ql doesn't mean that I'm an expert on these things. I'd probably use a Lisp, because that's generally my preference. I'd probably use SQLite for the backend, as long as it worked reasonably well, because it's a proven, reliable system. Other than that, I can't offer any advice, beyond studying applicable literature and examples.

If you're after discussion, you'd probably get better responses on r/emacs or r/orgmode.

alphapapa / org-ql

[DISCUSSION] What prior research is there on hierarchical document models (ala org-mode's headings) in information retrieval? #257