SEARCH on DHARMA-base: preparatory notes

I am pasting here some useful thoughts and links to code in preparation for real action to make our TEI data searchable.

Email @danbalogh to former Beyond Boundaries colleagues, 2020-02-13 I am writing to you because I, or rather my colleagues with better technical expertise, would like to pick up again the topic of searching Sanskrit (and other S and SE Asian language) texts. Our corpus will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add tags, though we may do so for part of the corpus later on. So we're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in <choice>) and searching it as fruitfully as possible. Gethin, do you have the search algorithm that somebody at the BL developed for the first incarnation of Siddham? That seemed rather good to me at a certain time, but it does not seem to have survived into Michael's facelifted Siddham. Élie and/or Hélios, how about "lenient search" for BDRC? If you can give us any pointers, please let me know along with Arlo, one of our PIs who is copied in on this message. If any of your code, even half-baked, is publicly available, we'll welcome links; if you have non-public code or documentation that you are willing to share, then likewise.
Email @danbalogh to Indic-Texts TEI Special Interest Group, 2020-02-13 We in the DHARMA project will be looking into making our epigraphic corpus searchable. Our texts in Sanskrit and other S and SE Asian languages will be marked up in TEI (EpiDoc) and will be Romanised, mostly according to ISO-15919 but with some quirks on top of that, including a few extra characters and case sensitivity. We will not be lemmatising the corpus at short notice, nor is it likely that we'll add tags, though we may do so for part of the corpus later on. We're interested in extracting transliterated text from TEI XML (sometimes including alternative strings in ) and searching it as fruitfully as possible. Ideally, we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for. I myself do not have the level of technical preparedness even to understand our options, and will be passing any suggestions on to people with the necessary expertise. But to get started, I would welcome some basic suggestions and pointers: any already working open source specialised code we should check out? Any general search solutions that may be adapted to our purposes?
Responses to 1 from Rees, Gethin Gethin.Rees@bl.uk, 2020-02-13 Yes, Michael is very keen on using Google in the new Siddham. He made a very persuasive argument to me on this point. Although I disagreed with him, I couldn't seem to articulate my thoughts sufficiently to challenge him! This was a shame as the search was one of the few better points about the old Siddham. I attach the code here. URL SNIPPED. Code is closed but was intended to be made open. I understand that Beyond Boundaries own it. Do not share the code as a whole beyond your project however as I think there might be credentials in there that would be removed before open sourcing it. Individual pieces can be incorporated into a new code base and made open source. Developers were Stephen Stose and James Alexander so I would mention them in your readme.
Response to 1 from Élie Roux elie.roux@telecom-bretagne.eu Regarding "lenient search" for BDRC, I'm quite happy of the current implementation, it's all on but the code is really made to be plugged in Lucene 7. I suppose it can be an inspiration for non-Lucene platforms but it would require some work. We have some partners who are using eXide and converted some aspects of the analyzer to Lucene 4 (which is the version eXide is unfortunately stuck to), it's in the "lucene4_port" branch of the git repo. The code should be documented enough, but I can go into more details about what we implemented in Lenient mode. Also, we gave up the automatic lemmatization and now the Lucene tokens are the syllables. It has no support for the <choice> thing (which should probably be handled at the XML database level anyways), although there's an obvious way to encode that in Lucene streams (that I don't remember now but can find).
Responses to 2 from Andrew Ollett <andrew.ollett@gmail.com>, 2020-02-13 It seems to me that there are two options:

Creating a plain-text corpus from your TEI corpus (by means of XSL transformations), and searching this corpus the way that you would search a directory of texts on a local computer (e.g., grep). This is not a very "high-tech" solution, but I think this is how most of us search the GRETIL archive, and it's very straightforward. To get "fuzzy" results you would have to probably write custom shell scripts that replace a given search term (e.g., "dharma") with one that is modified to be "fuzzy" (e.g., "dha(r)*((ṁm)|(m)+)a" or whatever).
A web application with a search interface. Good luck with this. It would probably involve Apache Lucene. The source code for the SARIT web application, designed for a transitional version of eXistDB (between 3 and 4), is available on GitHub.
Peter Scharf's response meant that for searching purposes, it is advisable to convert the ISO-15919 texts to something like the SLP-1 encoding, which does not use digraphs to represent single phonemes (thus we shouldn't get "dharmya" when we search for "harmya," or "aitara" when we search for "itara," etc.). This is a good suggestion, but SLP-1 doesn't include representations for sounds found in languages other than Sanskrit (e.g., short e and o, alveolar consonants in Tamil, retroflex approximants in the Dravidian languages, the Indonesian pepet, etc.) which are part of the DHARMA project.

Responses to 2 from Patrick McAllister <pma@rdorte.org>, 2020-02-13 You’ll have to distinguish quite carefully between the two search methods you want

we should have two search methods, a lenient one to gather fuzzy results and tolerate e.g. variations in epigraphic spelling without returning too many false positives (at the moment we only have some rudimentary notes for the specifics of this), and a strict one to return the exact string searched for.

It’s unlikely that these two wishes can be fulfilled by the same search engine. “Fuzzy results” is a slightly, ahem, fuzzy term. One important case of fuzzy search is full text search. So, if you search for the terms “X” and “Y” you’d want to find “A X B Y C” as well as “Z Y X W”, “X was in Y”, and perhaps also “X but not Y”. Usually, the utility of full text search depends heavily on how well the indexer can analyze a given language (not script!). In an English language search, it’s standard to get matches on “did” and “done” for a search on “do”. But we don’t really have that kind of thing for Sanskrit yet; Oliver Hellwig’s statistical approaches report success rates in the correct analysis of Sanskrit of around 85%. And if you start mixing languages (not just scripts), then rather weird things might happen if you ask your search engine to index them all in the same way.

To be on the safe side, your project should use the @xml:lang tags with some foresight. You’ll probably want to note both the language and the script (e.g., “sa-Latn”), especially if any inscriptions write different languages in the same script, because that is what your search engine will need to know: which strings are Tamil, which are Sanskrit, and so on. And it’s unclear if you want to search for the same terms across all languages in your collection. Would that be useful?

Spelling variations are a slightly different kind of problem. If you manage to reduce them to regular expressions, you could probably configure your search engine’s indexing functions to smudge the texts for this differences (e.g., don’t differentiate between “ba” and “va”). It’s also quite common to index the same set of texts in several different ways. E.g., one case sensitive, one not, one with punctuation, one without, one with corrections for spelling errors, and so on. You will really need to experiment with the settings, and decide what things should be searchable.

In any case, all the full text searches work in broadly the same way: you need to get the text out of the TEI encoding and into a text-only format that is both simple enough so that a search engine can index it properly yet rich enough to satisfy your queries. The trick is really to find the right balance. Then you configure the search engine to either ignore or augment certain things in your texts.

Full text search engines usually have only very minimal understanding of structural features: many can deal with simple HTML elements (e.g., rate a match in a heading higher than a match in a normal paragraph), but I’ve found that to be rather useless for what we were doing in SARIT (most texts not having headings you’d want to search, having been added by the editor---again, that’s a decision you make; for other researchers, that might be very interesting). You should remove all kinds of notes and other interferences (gaps, line numbers, etc.) that might detract from the text. Of course, there’s no one right way to do this. You’ll have to experiment quite a bit with getting the right extract of the text that is buried in your TEI markup. In SARIT, we tried to remove everything except lg-s (metric text, mostly), and paragraphs.

I suppose you also have quite a bit of “hard” data about your inscriptions: size, location, etc. It would be great to add this in a search interface, and it’s usually easy to add fields to the index storing these kinds of things.

The other search you mention is strict search. As Andrew wrote, that’s much easier in terms of technology. You just extract the full text (perhaps thinking a bit again about notes, labels, etc.), and then use grep or something (make sure to take care of linebreaks properly, many greps search only single lines).

Perhaps of interest for your technical staff: you can try a simple search interface for the SARIT texts here: (Andrew’s search of "dharmya" and "harmya" should work there), and you can look at the technical interface (and especially the index configuration) here: This was just a proof of concept for SARIT and isn’t being updated anymore, but the searches still work. Technically, this uses Lucene through something called Elastic (). Elastic is a very nice interface to Lucene, but the license model is weird; it’s now dropped out of several free software distributions, including Debian. You might want to check that. But Lucene is certainly a very solid basis for anyone starting with full text search.

erc-dharma / project-documentation

SEARCH on DHARMA-base: preparatory notes #7