drdhaval2785 / SanskritSpellCheck

spell checking based on patterns
1 stars 1 forks source link

Simple Sandhi Proofreader (nṭ -> ṇṭ, sṭh -> ṣṭh) #1

Closed gasyoun closed 9 years ago

gasyoun commented 9 years ago

As a part of the "Non-sandhi Pattern Checker" code I would want to have a way in .txt or .doc to find words that are fishy from some of the word sandhi rules, that are easily codable. If we have nṭ in text, probably it should have been ṇṭ. Mark such cases. That's an easy one. But what if we see "ṛr"? Yes, in books hardly possible, but I see this kind of mistakes in files quite often. Everything outside the sandhi table should be marked. Possible?

Shalu411 commented 9 years ago

Namaste

  1. Never possible ones- न्ट, न्ठ, न्ड, न्ढ >> (positive ones) (respectively.) ण्ट् ण्ठ ण्ड, ण्ढ
  2. Same with ष्त, ष्थ, >> ष्ट, ष्ठ, After ष्, त वर्ग is never possible. (Do not know if same applies to ष्द, ष्ध >> ष्ड. ष्ढ Never remember seeing these combinations at all.)
  3. Similarly, स्ट, स्ठ, स्ड, स्ढ are impossible ones. May be we can make a similar list for all possibly impossible cases?

"ṛr"?...but I see this kind of mistakes in files quite often....-- Example please?

gasyoun commented 9 years ago

Example Case 273

Shalu411 commented 9 years ago

Case 273: 9/29/2014 dict=CCS, L=29015, hw=svfrRara, user=gas old = svṛrṇara new = svarṇara comment = dirty scan status = Corrected

OK. Right- Then that's an impossible case. So can be listed.

drdhaval2785 commented 9 years ago

I guess, we don't need to do so. We test dictionary A with B dictionary and then B with A as base. There is no possiblity that both the files have the same data entry issue. Therefore, it will be detected in one of the output files.

Once we correct all the 6-7 dictionaries of Cologne, we get cleaner files. From those cleaner files, we will be able to derive proper allowable CV patterns (Consonant-Vowel) for Sanskrit in future.

For that we need to correct the errors. try helping in https://github.com/sanskrit-lexicon/CORRECTIONS/issues/2

gasyoun commented 9 years ago

6-7 clean dictionaries is about to take 2 years, if you speak about headwords only. Patterns will solve 200 cases at best. So no, we need to weed such things out not only in comparison between dictionaries. If I do an OCR in ABBYY and get a dirty .doc file, I could weed out 30% mistakes just by pattern replacement.

drdhaval2785 commented 9 years ago

6-7 clean dictionaries is about to take 2 years, if you speak about headwords only. I disagree. For my purpose of finding allowable patterns in sanskrit 'clean' would mean dictionaries which are giving only false positives when compared with patterns of other dictionaries. Generation of suspect list can take only one day. Testing them may take just 1 month if we have dedicated person. So, hardly 1 month before I can give you the correct pattern list.

Patterns will solve 200 cases at best. So no, we need to weed such things out not only in comparison between dictionaries. If I do an OCR in ABBYY and get a dirty .doc file, I could weed out 30% mistakes just by pattern replacement. Statistically it is not wise to make an exclusion list. Out of all possible permutation and combination of C and V, only 1% maybe permissible. so rather than creating a 99% exclusion list, 1% inclusion list seems more workable to me

gasyoun commented 9 years ago

Inclusion list in a .doc file - how do you think it should work? I understand that bad is replaced by good. But to say about bad it's bad - what use? It will kill my time just to show, but not to fix. My aim is to fix, fix quickly. Better some false positives, than a really dirty text.

drdhaval2785 commented 9 years ago

examples please. Do you want it to be spell - corrector rather than spell-checker? Thats a different story altogether

gasyoun commented 9 years ago

Here come examples. And yes, it seems I'm speaking about a spell-corrector, not spell-checker. Please see my .PDF after I added an OCR layer to it with ABBYY Fine Reader 12. Same text in .DOC.

Several issues can be solved by general RegEx cleanup rules, like svatantr0 from svatantratvāt svatantr0 hi naiva dpṣeṇa l^pyate others like l^pyateare harder. But dpṣeṇa - that should be an easy fix, at least marking it as fishy.

So documenting them is possible, the question is what to do with them next?

pūrnaprajñakfteyaip sañkscpād uddhfitih suvākyānām /
śrīmadbhāratagānāip vi$noh pūrpatvanirpay3yaiva /
drdhaval2785 commented 9 years ago

@gasyoun It is beyond scope of this repository. Once we are through with the corrections in dictionaries, maybe we should start another project for spell-corrector. Not in my purview right now.