funderburkjim / testing

For testing various features of github. Nothing important here.
0 stars 0 forks source link

review of sanskrit-lexicon corrections:faultfinder #6

Open funderburkjim opened 9 years ago

funderburkjim commented 9 years ago

Review of corrections to sanskrit-lexicon

corrections based on faultfinder

Based on a review of the history.txt, the use of Dhaval's faultfinder approach bean in October 2014. It was initially used to find spelling errors in headwords for MW. The headword spellings in the Cologne digitization of Monier-Williams Sanskrit-English dictionary seem to be largely cleaned. Thus, the use of faultfinder since October 2014 has taken the MW headword spellings as a reference to which headwords from other Sanskrit dictionaries may be compared. Here is the list of dictionaries for which faultfinder-generated potential headword spelling errors have been examined:

 PW,  PWG,  VCP, SKD, CAE, SHS, SCH,  WIL, YAT, MD,, MW72, GST, BHS, GRA, BUR, 
AP90, AP (not on Github, Sampada)
also, CCS (done, not yet installed)

This list includes all the Sanskrit-English Dictionaries (11 + 1=AP), all the Sanskrit-German Dictionaries (4), the Sanskrit-Sanskrit Dictionaries (2), and one of the two Sanskrit-Friench Dictionaries.

faultfinder TODO (1)

A reasonable next step to do would be to complete the faultfinder analysis of corrections for the remaining dictionaries with Sanskrit headwords. The list of dictionaries with Sanskrit headwords where the faultfinder suggestions have NOT yet been examined is:

Specialized Dictionaries:
INM, VEI, PUI, ACC, KRM, IEG, SNP, PE, PGN, MCI
Sanskrit-Latin: BOP
Sanskrit-French: STC
Sanskrit-English: PD

It is possible that some of these (KRM) may be inappropriate for faultfinder analysis. The longest list comes from PD.

faultfinder TODO (2)

It is sometimes the case that faultfinder suggestions are 'false-positives', in the sense that examination of the case results in the conclusion that the existing digitization spelling of the headword is correct. For some of the examined dictionaries, a list of these false positives has been preserved within a comment for an issue in the issues list. It might be worthwhile collecting these lists into a file, so that further work will not re-examine this list. The file could have a simple structure as a file of lines, with each line containing a headword and dictionary code.

Headword correction candidates via alphabetical ordering

The general principle of the faultfinder approach to finding possible misspelled words is that there are patterns to the spelling of Sanskrit headwords. Using MW as the reference, a list of common correct spelling patterns can be found. Then, if the spelling of a word in a test dictionary has a spelling pattern found in none of the headwords of the reference, it is reasonable to think the word might be misspelled.

Experience has shown this approach to be not only productive but fairly efficient in generating lists of possibly misspelled words.

There are some kinds of spelling errors that are missed by this approach, however. So other approaches approaches have been examined.

One of these approaches is based on the observation that in most dictionaries the headwords appear in alphabetical order. Thus, a headword which is OUT of alphabetcal order MAY be out of order because of a spelling error. Examining the spelling of words out of alphabetical order has been undertaken for just a few dictionaries:

 SKD, WIL, VCP,GRA

This approach was also begun for AP, but so many false positives were found that it was not completed.

It might be worthwhile trying this approach on other dictionaries.

Use of fuzzy comparison to generate suggestions

The examination of a list of possibly misspelled headwords from a given dictionary is a conceptually simple task but it is quite labor-intensive in practice. There is value in specialzed computer programs which can provide assistance to the person doing the examination. One such program which I've found useful for some dictionaries makes use of the interesting notion of edit distance between two words, also called 'Levenshtein' distance. I find the computer implementation of the algorithm to be hard to understand, but the idea is easy. For instance the distance between 'dog' and 'dot' would be 1, since replacement of 1 character (the 'g' by a 't') changes the spelling of 'dog' to 'dot'.

Here's how I've this edit distance notion to generate correction suggestions for a possibly misspelled Sanskrit headword, X from dictionary D. The sanhw1.txt file gives a list of Sanskrit headwords, as currently spelled in the Cologne digitizations of all the dictionaries. We suspect that is misspelled, and that its corrected spelling might appear as a headword in some other dictionary. So, look at all other words Y in other dictionaries (than D) from sanhw1, and keep only those Y whose spelling is almost the same as X; the result will be a list L of spelling suggestions for X.

In practice, there are several important details that must be added to this approach. Since there are roughly 300,000 headwords in sanhw1, it is computationally impractical to examine ALL of these headwords. So one or more techniques must be used to prune the initial list. The first approach I've taken is to assume that the first letter of X is correct, and thus to compute the edit distance of X from Y only for those Y which have the same first letter.

All the headwords are assumed to be spelled using the SLP1 transliteration.  An equally good choice would be the WX transliteration. However, transliterations like HK, ITRANS, IAST would be less good, since in these there are often multiple letters required to spell single letter of the Sanskrit alphabet.  Unicode Devanagari itself might also be a less good choice, due, for instance, to the requirement of having virAma character in conjunction consonants -  however, it might be interesting to investigate edit distance between Devanagari spellings.

The second pruning of L is accomplished by considering only Y whose edit distance from X is no larger than some prespecified maximum M, such as 1 or 2 or 3. Choosing too small a value of M might yield too few (or no) suggestions in L, while too large a value of M will generate too many suggestions for L I think I've usually used 2.

A third enhancement is based on alphabetical ordering. The headword X appears in the given dictionary D between two words X1 (the headword before X) and X2 (the headword after X). Thus, those Y in the suggestion list L are preferred which have the further property that X1<=Y<=X2. This preference is used to partition the list L into two parts (one part has the alphabetical ordering property and the other does not have this property).

This describes the idea of how I've used the generation of fuzzy suggestions. For many dictionaries, these suggestions have been found to be quite useful as an adjunct to correcting spelling errors generated by faultfinder, and have thus speeded the examination process. The program which generates the list is a python program which has thus far not been published, I think. If anyone wants to use it, I can make it available.

Compare headwords in 'similar' dictionaries

If we had two independently generated digitizations, D1 and D2, of the same dictionary, then a good way to search for spelling errors would be to look for differences between D1 and D2. Resolution of these differences would likely result in an improved, unified digitization D for the dictionary.

This technique has been used by Malten in recent digitizations (those of 2012-2014) in the initial digitization phase, where he has reported the use of double entry.

The first use I made of this comparison approach was with two digitizations, the Tirupati edition and the Cologne edition, of the Vacaspatyam dictionary (The work is here, but it is rather hard to understand.) This comparison is complicated by the fact that the two digitizations were done with different principles. However, one (of several) end results is that a comparison of headwords was possible - I don't think the specific headword comparison has been published, although it has been used informally by Sampada in examination of headword corrections in the Cologne Vacaspatyam. It might be useful to systematically examine this headword comparison. Also, there is a 'line-by-line' comparison of the two digitizations, along with edit-distance, that could be used for non-headword spelling correction. This is a big task, but would likely permit a unified digitization (incorporating not only better spelling , but also the existing markup of grammatical forms of the Tirupati edition.)

Recently, I adapted this comparison technique to apply to the headword lists from Wilson and Yates (see the Wil-YAT repository for all the programs and data). This comparison yielded many corrections that had fallen 'under the radar' of faultfinder. The comparison was reasonable, since Yates explicitly based his dictionary on that of Wilson.

A similar comparison between Wilson and the Shabda-Sagara dictionary would also be fruitful, I think, in finding corrections to SHS, since SHS is also overtly based on Wilson.

Similar comparisons between headword lists among the Böhtlingk dictionaries (PWG, PW) and also Cappeller dictionaries (CAE, CCS) might prove fruitful, whenever anyone wants to try them.

Another pair would be AP90 and AP, since AP is developed as a revision of AP90.

Correction of Missing data

Most of Sampada's Sanskrit Dictionary work in 2014 was aimed at filling in the 'missing' data for the various dictionaries. These missing data cases were ones where Thomas's group inserted question marks (e.g. in form {?}, or some similar identifiable form) at places where the print edition was unreadable or used difficult to code symbols. This work has been done for these dictionaries; that these are for the most part NOT headword corrections.

 VCP (about 4300 cases), AE (500), AP90 (700), AP (40), BEN (230), BOR (220), MW72 (240),
MWE (170), SKD (110), PWG (350), PW (40), 
Total of these cases = 6900.
During this work, it was noted that no missing data was marked in:
BHS, CAE, CCS, GRA, MW,SCH, SNP

According to one study, here are the other dictionaries with missing data cases; these have not yet been examined:

ACC (22), BOP (12), BUR (16), GST (21), IEG (2), INM (102), 
KRM (23), MCI (9), MD (1), PE (19), PGN (4), PUI (5), 
SHS (10), STC (1), VEI (167), WIL (1), YAT (1)

Miscellaneous corrections