Improve citation matching

yiufung commented 3 years ago

Hi @thomp , how's it going? It's been a while and I wish all is well with you.

These days I'm thinking to improve upon citation matching. I believe this may involve some code changes to you so I'd like to discuss.

The idea is to come up with functions that help us identify format such as "Matthew 1:10", "Luke 1:10-20", "Genesis 3", etc. To do this I studied a bit on regular expression in Emacs, the main result is a regexp called dtk-citation-regexp that help to detect book/chapter/verse/verse-range. You may test with:

(s-match-strings-all
 dtk-citation-regexp
 "Testing citation between Matthew 2:45-56 or Genesis 23:34 or Luke 3.")

Based on it, I rewrote dtk-parse-citation-at-point. While there are still some edge cases, it works pretty well over my test buffer:

Randomtest (Genesis 1:1;)
Matthew 2:23-24
Not-Exist 3:234
Genesis 1:3
Something bewtewewn ssfs Matthew 3:23-23

The main benefit being that user can put point somewhere within the citation and it should work as expected. There are still restrictions though: the citation cannot cross between 2 lines. But I suppose it's not a major case.

Your input will be appreciated.

yiufung commented 3 years ago

Another feature I'd like to add in this branch is to refer to citation using abbreviations. For example, when user type "Lev. 1", it can be translated into "Leviticus 1" properly for further processing.

I think the implementation can an association list that matches abbreviation as key to the full name as value. I plan to follow common abbrevs in Logos and Nashotah.edu as a start. User may add more definitions themselves. As a result, dtk-citation-regexp will incorporate these abbreviations as well. I think this will be useful for functions such as dtk-follow.

thomp commented 3 years ago

The current implementation definitely has some weaknesses. I'm entirely supportive of moving toward handling both variants that you identify on full citations (abbreviated book names and verse sets/ranges). I see at least two reasons to do so:

a number of other diatheke modules lead to diatheke output with citations which are not full (both abbreviations and verse sets)
apart from the above, your proposal would better lend itself the use case where the citation source format is arbitrary (e.g., invoking dtk-follow in any buffer which contains a citation, in some form)

With respect to the situation where a citation crosses a line break, we can cross that bridge when we come to it. At this point, I don't believe I've encountered that as a need. Let's put it on the back burner.

I wonder whether it would be best to hold off on merging this branch until we clearly establish:

a consistent specification for representing verse sets and ranges
- "I John 3:16,18,20" as ("I John" 3 (16 18 20)) ?
- "I John 3:16-20" as ("I John" 3 (16 17 18 19 20))? ("I John" 3 16 20)?
a mechanism for parsing verse sets and ranges
a mechanism for handling abbreviations (which you allude to in your second note)

Thanks for investing time in this! Please let me know your thoughts regarding the above.

thomp commented 3 years ago

If a substantial amount of time is invested, it might be desirable to put together some testing on this. I haven't given much thought to any sort of testing for this project. Maybe go with ert? https://gist.github.com/thomp/e68d81d4319426bc3015d87dbaf5a442

yiufung commented 3 years ago

Since our parsing is handed over to diatheke to process, I did a quick test.

% diatheke -b KJV -o fmnx -k Jn 3:16,17,18,19,20,21 > sets
% diatheke -b KJV -o fmnx -k Jn 3:16-21 > ranges
% diff sets ranges # no output, meaning no difference to diatheke between sets and ranges
% diatheke -b KJV -o fmnx -k Jn 3:16-21,23-24 > additionals
% diff ranges additionals # means we can mix notations together
6a7,8
> John 3:23: <milestone marker="¶" type="x-p"/><w lemma="strong:G3588" morph="robinson:T-GSM" savlm="strong:G3588 lemma.TR:του" src="9"/><w lemma="strong:G1161" morph="robinson:CONJ" savlm="strong:G1161 lemma.TR:δε" src="2">And</w> <w lemma="strong:G2491" morph="robinson:N-NSM" savlm="strong:G2491 lemma.TR:ιωαννης" src="4">John</w> <w lemma="strong:G2532" morph="robinson:CONJ" savlm="strong:G2532 lemma.TR:και" src="3">also</w> <w lemma="strong:G2258" morph="robinson:V-IXI-3S" savlm="strong:G2258 lemma.TR:ην" src="1">was</w> <w lemma="strong:G907" morph="robinson:V-PAP-NSM" savlm="strong:G907 lemma.TR:βαπτιζων" src="5">baptizing</w> <w lemma="strong:G1722" morph="robinson:PREP" savlm="strong:G1722 lemma.TR:εν" src="6">in</w> <w lemma="strong:G137" morph="robinson:N-PRI" savlm="strong:G137 lemma.TR:αινων" src="7">Ænon</w> <w lemma="strong:G1451" morph="robinson:ADV" savlm="strong:G1451 lemma.TR:εγγυς" src="8">near</w> <w lemma="strong:G4530" morph="robinson:N-PRI" savlm="strong:G4530 lemma.TR:σαλειμ" src="10">to Salim</w>, <w lemma="strong:G3754" morph="robinson:CONJ" savlm="strong:G3754 lemma.TR:οτι" src="11">because</w> <w lemma="strong:G2258" morph="robinson:V-IXI-3S" savlm="strong:G2258 lemma.TR:ην" src="14">there was</w> <w lemma="strong:G4183" morph="robinson:A-NPN" savlm="strong:G4183 lemma.TR:πολλα" src="13">much</w> <w lemma="strong:G5204" morph="robinson:N-NPN" savlm="strong:G5204 lemma.TR:υδατα" src="12">water</w> <w lemma="strong:G1563" morph="robinson:ADV" savlm="strong:G1563 lemma.TR:εκει" src="15">there</w>: <w lemma="strong:G2532" morph="robinson:CONJ" savlm="strong:G2532 lemma.TR:και" src="16">and</w> <w lemma="strong:G3854" morph="robinson:V-IDI-3P" savlm="strong:G3854 lemma.TR:παρεγινοντο" src="17">they came</w>, <w lemma="strong:G2532" morph="robinson:CONJ" savlm="strong:G2532 lemma.TR:και" src="18">and</w> <w lemma="strong:G907" morph="robinson:V-IPI-3P" savlm="strong:G907 lemma.TR:εβαπτιζοντο" src="19">were baptized</w>.
> John 3:24: <w lemma="strong:G1063" morph="robinson:CONJ" savlm="strong:G1063 lemma.TR:γαρ" src="2">For</w> <w lemma="strong:G3588 strong:G2491" morph="robinson:T-NSM robinson:N-NSM" savlm="strong:G3588 strong:G2491 lemma.TR:ο lemma.TR:ιωαννης" src="8 9">John</w> <w lemma="strong:G2258" morph="robinson:V-IXI-3S" savlm="strong:G2258 lemma.TR:ην" src="3">was</w> <w lemma="strong:G3768" morph="robinson:ADV" savlm="strong:G3768 lemma.TR:ουπω" src="1">not yet</w> <w lemma="strong:G906" morph="robinson:V-RPP-NSM" savlm="strong:G906 lemma.TR:βεβλημενος" src="4">cast</w> <w lemma="strong:G1519" morph="robinson:PREP" savlm="strong:G1519 lemma.TR:εις" src="5">into</w> <w lemma="strong:G3588 strong:G5438" morph="robinson:T-ASF robinson:N-ASF" savlm="strong:G3588 strong:G5438 lemma.TR:την lemma.TR:φυλακην" src="6 7">prison</w>.

So the structure would look like (Book, Chapter, Verse), where Verse part:

Is optional. When it's not provided, the full chapter should be returned. (Luke 3)
Verse part can be:
- individual verse (Luke 3:1);
- or verse set (Luke 3:1,2,3, separated by comma ,);
- or verse range (Luke 3:10-12, separated by dash -);
- or a mixed of above, separated by comma ,.

I think this is properly handled by diatheke already. We only need to improve the regular expression to identify the Verse part properly. I will test and push more commits in the coming days.

Adding a test suite would definitely help. Also, I think adding some elisp formatter/linter would help us maintain the code too. I will keep an eye on these topics and update later.

thomp commented 3 years ago

It sounds like there isn't much that needs to be dealt with with the representation of sets and ranges with Verse if the only consumer is diatheke. As you note, that just leaves developing a regex to handle the different cases.

Agreed that elisp formatting needs consistency. Is it time to untabify everything? Maybe formatting/linting is better as a separate issue.

Looking forward to upcoming commits. Thanks again.

dtk01 / dtk

Improve citation matching #13