drdhaval2785 / SanskritSpellCheck

spell checking based on patterns
1 stars 1 forks source link

NO CHANGE words list in o vs O #11

Open gasyoun opened 8 years ago

gasyoun commented 8 years ago

There are two kind of issues, the short living-OCR and the everlasting print: 1) https://github.com/sanskrit-lexicon/CORRECTIONS/issues/131#issuecomment-152554547 if fixed niGAnIGa -> niGAniGa (MW OCR error) after regenerating sanhw, will be dead. The dead ones. 2) https://github.com/sanskrit-lexicon/CORRECTIONS/issues/131#issuecomment-152554116 The everlasting ones. nirUzmatva niruzmatva - both are legal, if if one is known as fehlerhaft in PW, it will always so remain.

For words from 2nd list we need to build a NO CHANGE list, @drdhaval2785 . Adding them after working with MW will not bring the same issues back when we work on PW and PWG. As we work on at least 2 dictionaries at once I would love to know how to update the sanhw1 and sanhw2 files myself. Is there an instruction out there, @funderburkjim ?

funderburkjim commented 8 years ago

re 'Is there an instruction out there,'

Yes, At https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/sanhw1/readme.org.

But the program (sanhw1.py, also in the repository) needs data to work. Specifically, it needs

All of these 36 files (1 per dictionary) are available in various downloads. But to update sanhw1.txt when a change has been made to one of the headword lists would require that you have local copies of the updated headword files. That's why it would be hard for you to do, though not impossible.

As a practical matter, for the time being, you should just ask me to update sanhw1.txt (and put the new edition in the sanskrit-lexicon/CORRECTIONS repository) when you think it is needed.

drdhaval2785 commented 8 years ago

Now deviced a program and methidology to build nochange files and incorporating them in any error finding technique. So the issue under consideration is over.

It gave 10% decrease in file size. So fruitful exercise in toto.

drdhaval2785 commented 8 years ago

Regarding sanhw1.txt, i would give it a try on my local copy. Let me see if I can update it.

gasyoun commented 5 years ago

Let me see if I can update it.

There was no update in more than a year. Fixable?

gasyoun commented 3 years ago

Regarding sanhw1.txt, i would give it a try on my local copy.

Ever managed to?

drdhaval2785 commented 3 years ago

When sanhw1.py and sanhw1.txt were in CORRECTIONS repository, I did regenerate it some time.

Now both have shifted to csl-corrections repository. I have not yet tried it there. Right now, there was a headword correction in mw72.txt. Therefore, I am planning to give it a go.

drdhaval2785 commented 3 years ago

I gave it a try. The cureent sanhw1.py is in hwnorm1 repository.

When I tried to run it, I got an error, because I do not have AP and PD data locally. It seems, I will not be able to get these two dictionaries working locally, because my Koeln username has expired. Only once it is active again, I will be able to clone from the bare git repository.

Till then, request @funderburkjim to generate hwnorm1 and sanhw1 and sanhw2 and upload to github, so that regeneration of vaious error reports done 4-5 years can be done.