ambuda-org / ambuda

Main application code for Ambuda, a breakthrough Sanskrit library (ambuda.org)
https://ambuda.org
MIT License
88 stars 23 forks source link

Fuzzy search for dictionaries #379

Open shreevatsa opened 1 year ago

shreevatsa commented 1 year ago

(Mentioned by @suhasm on Discord.)

Many people aren't familiar with transliteration. So it would be useful if they could type "rama" and get results for "rāma".

Possible approaches:

akprasad commented 1 year ago

Of these, I think the second option sounds most compelling.

Philosophically, though, my controversial take is that a user having to look up a word manually is a failure, because it means:

camsanders commented 1 year ago

Students won't know whether the citation form of a word uses "n", cerebral-n (R), or "m", or some other nasal. Especially after removing an upasarga that contains one of the "r" vowels or consonants, and so on. What I have seen, as a second year student of संस्कृत, is that the online tools are all too rigid for beginners. Even after properly unwinding sandhis, different tools fail for different reasons. Consider using regex character sets like [ ङ् ञ् ण् न् म् M] (where 'M' represents Anusvāra) for any nasal - of course, for speed, tricks (such as memorizing expanded-match patterns in unique-string space) must be employed. Panini rules allow several optional additions and removal of characters, so here too, mapping to a regex type string that gives a unique pattern to recognize the legal variations might work wonders.

Which source modules would one look at to understand core dictionary data and for adding search APIs?

akprasad commented 1 year ago

@camsanders welcome! Thank you for the astute comment.

Which source modules would one look at to understand core dictionary data and for adding search APIs?

I'll first answer for the world we're in now. Then I'll sketch a direction I'm quite excited about.

First, where we are now. We do some level of basic normalization, but we can go much further.

The transliteration format we use internally is SLP1, which is more amenable to computer procenning. You cand find a definition of the schema here.

Dictionary entries themselves are defined here. We populate these tables with the seed scripts here.

We provide active onboarding support on our Discord channel, and we are currently working on improving our onboarding process as well.

Next, the direction I'm quite excited about:

I completely agree that it's unreasonable to expect students to know the conventions of citation forms. Even now I find myself tripped up on remembering, e.g., that the Monier-Williams convention is to use the anusvara before a consonant except if that consonant is प/फ/ब/भ.

The deeper problem, however extends beyond lemmata: often, it's not even clear what the citation form should look like. I can't imagine the pain a beginner would have in looking up words like श्यति (citation form शो) or संचस्क्रुषी (citation form संस्कृ).

So, what if we could generate all words with variant spellings ahead of time? If we can combine a Paninian word generator with a memory-efficient dictionary, we could look up words much more directly. And if we combine such a system with a fast segmenter, we could even allow arbitrary phrase search.

camsanders commented 1 year ago

A wonderful reply! Thank you for the information. SLP1 - good choice! Thank you for the links. Perhaps we can take this discussion private briefly, and create some Future Vision pages in the wiki? Or start a thread with a more fitting title.

As to the vision, Beautiful!! That is/was the vision a friend and I had/have - although we are also wanting to build a scholar reviewed gold-standard, normalized Dictionary. We were starting to spec out a new project, and decided to survey the landscape and see if we can build on any projects, or join one. This project and the python sanskrit_parser jump out at me. You have several things that we want: crowd sourcing the annotations of documents, etc.

As I go through M.R. Kale's 2022 "higher sanskrit grammar", there seem to be quite a few optional rules, and so to forward calculate/realize the full set of permutations (the fan-out of legal realizations), I started seeing more factors of 2 than I expected. I believe there are at least 2 more zeroes in that data set than my original calculation, and we are definitely moving way into the billions of entries with multiple upasargas, and quite likely then, multiple link-backs to dhatus with use-case specs (padas). (The latter point also hints to more redundant-matches of unlikely-manifestations than are found in nature.) Of course, using associative lookup techniques, with electronic-disks, size is still not a barrier. Tho with the data size going up and the redundancy of matches and I am simply starting to question the value of this path. [Tho not hard to experiment with, and worth exploring!] It still doesn't address the fuzzy matching problem.

In the short term, an adaptive system could be built. [the following is mostly regarding pada lookup, but some applies to shlokas too.] The fast extended "use index" should be built from actual usage. (Whether usage comes from a user typing in a string, OR from automatically parsed documents.) When there is a hit, return it (and/or the list of options, which contains use-case info for each option). When there is a miss on a word lookup, then parse it, and if there is a hit with a pada, then add this variation to the use-index. If nothing is found then parse again after adjusting for potential errors made during sandhi-reversal and/OR go to extended automated pattern matching. Try really hard to give the user something.

Anything not found with some kind of perfect match should be put in a list for scholar review; anything found with thesthe above search-variations should be included in that list. Maybe we have the user confirm they are human, and type in a phrase indicating context, e.g. "asthanga hrdayam", or "yoga student". The scholar-review-list would be sorted by most frequently used/searched missed-items, and then the scholars can approve appropriate matches for real-world "use index".

E.g. beginners should be able to type , intentionally without Roman diacritical notation, these types of entries: sanskrit samskrit sanskrut sanskrta etc ! These (and many more) common beginner errors should be added to the use-index.

Interest in Ayurveda and the Vedas is growing in the USA. I would like to make learning & understanding easier for them. So it would also be nice for the user to see maps from words back to the various source docs. Monier-Williams has many references, but in the sites I use, they are not traceable links.

camsanders commented 1 year ago

One more point that is obvious to the experienced developer: any of what I suggested can be controlled by user-preferences. It would be nice to have a user profile (even with browser-session) options for the extent of searching, etc. Continuing, it would be nice to have content-access authorization for commentaries on various texts with shloka level identification - and ideally, provide a means that content providers can have that content maintained on their own server. In my ideal world, we define a REST API for basic dictionary, lookup, parsing, and then for commentaries - which 3rd parties can provide. Perhaps a subscription system could pay for some work and/or cost of keeping servers online.

akprasad commented 1 year ago

Thank you for the links. Perhaps we can take this discussion private briefly, and create some Future Vision pages in the wiki? Or start a thread with a more fitting title.

For wider circulation, let's use the ambuda-discuss mailing list:

https://groups.google.com/g/ambuda-discuss

If you start a thread there, we can use it to continue our discussion.