drdhaval2785 / SanskritSpellCheck

spell checking based on patterns
1 stars 1 forks source link

faultfinder1.php variant #7

Closed funderburkjim closed 9 years ago

funderburkjim commented 9 years ago

I downloaded a zip of SanskritSpellCheck, and made a variant of faultfinder program. You're welcome to put it in your repository if you want to.

The program and two extra files are at https://dl.dropboxusercontent.com/u/29859999/faultfinder1.zip

Here is a summary of the changes:

  1. The program must now be run at the command line with four parameters:
 php faultfinder1.php <dict1> <dict2> <dictref> <output>

For instance, to generate the output comparable to the 'suspectfalse.html' at the time of the download:

php faultfinder1.php VCP MW MW vcp_mw_suspectfalse.html

<dict1> code is used to construct first filename; e.g. VCPslp.txt.
<dict2> code is used to construct second filename; e.g. MWslp.txt

As with faultfinder, faultfinder1 assumes these two files are in the same directory as the program.

<dictref> code is used to construct the 'constant' part of the href that is printed for each candidate.

A special constant associative array gets the 'year' part of the href based on the <dictref> code:

$hrefyeardict=array("ACC" => "2014","AE" => "2014","AP" => "2014","AP90" => "2014",
       "BEN" => "2014","BHS" => "2014","BOP" => "2014","BOR" => "2014",
       "BUR" => "2013","CAE" => "2014","CCS" => "2014","GRA" => "2014",
       "GST" => "2014","IEG" => "2014","INM" => "2013","KRM" => "2014",
       "MCI" => "2014","MD" => "2014","MW" => "2014","MW72" => "2014",
       "MWE" => "2013","PD" => "2014","PE" => "2014","PGN" => "2014",
       "PUI" => "2014","PWG" => "2013","PW" => "2014","SCH" => "2014",
       "SHS" => "2014","SKD" => "2013","SNP" => "2014","STC" => "2013",
       "VCP" => "2013","VEI" => "2014","WIL" => "2014","YAT" => "2014");

<output> is the pathname of the output file, such as the example vcp_mw_suspectfalse.html.

The calling sequence of the main subroutine comparepatterns was changed so that all parameters are passed as arguments. No global variables are required.

The first and third parameters of comparepatterns was changed to point to the array of key1 values associated with dict1 and dict2, rather than filenames. These arrays are filled from the files BEFORE the loop for($b=0;$b<10;$b++). Thus, the key files only need to be read once, rather than 10 times. This makes the program more efficient.

The max_execution_time and memory_limit settings were commented out; they are not needed in the php command line interface.

The output file is in the zip file.

A file comparison was done between the two versions of what should be the same output file:

In Git Bash for windows,

diff suspectfile.html vcp_mw_suspectfalse.html > vcp_mw_diff.txt

Interestingly, the diff shows the files to be identical up through line 19889.. The first file ends at line 19890, and has no cases for the last three patterns (b=7,8,9).

This may be due to a timeout or memory limit in the execution of findfault.php.

One advantage of having faultfinder1 to be a command-line program, is that someone could make a batch file to run a sequence of several comparisons.

One other enhancement that might be worth doing is to have a 'known-exeptions' file for each dictionary; say MW_exceptions.txt, for instance.
That way, special cases like the known Vowel-Vowel patterns could be ignored. All those occurrences in MW probably need some special handling, and they likely just get in the way as 'false positives' most of the time.

drdhaval2785 commented 9 years ago

Thank you very much Jim. Great enhancement. I have included it in the repository now.

One advantage of having faultfinder1 to be a command-line program, is that someone could make a batch file to run a sequence of several comparisons. I heartily agree. This will reduce the burden of creating various comparision files.

One other enhancement that might be worth doing is to have a 'known-exeptions' file for each dictionary; say MW_exceptions.txt, for instance. That way, special cases like the known Vowel-Vowel patterns could be ignored. All those occurrences in MW probably need some special handling, and they likely just get in the way as 'false positives' most of the time. I agree to this point that we should ignore some special words. My submission is that we should not ignore those patterns. We can have an exception list of words having those vowel-vowel patterns. By ignoring vowel-vowel patterns, we will miss many things which are true positives. By not ignoring vowel-vowel patterns, we get too many false positives.

The best approach would be:

  1. cull out a list of all vowel-vowel combination from all dictionaries.
  2. Check it for veracity.
  3. Correct them if wrong.
  4. Thereafter we will remove the vowel-vowel combination totally from our code.

Point 1 - can you prepare such a list with href to its page @funderburkjim ?

gasyoun commented 9 years ago

I even tried to do same batch coding myself, but failed. @funderburkjim Is it possible to compare each headword list with each other and know which are the most unique ones?

funderburkjim commented 9 years ago

See https://dl.dropboxusercontent.com/u/29859999/sanhw1.zip

The included sanhw1.txt is a first pass at merging the headwords in ALL the general dictionaries that have Sanskrit headwords, namely those whose Cologne codes are in any of:

san_en_dicts = ["WIL","YAT","GST","BEN","MW72","AP90","CAE","MD",
               "MW","SHS","BHS","AP","PD"]
san_fr_dicts = ["BUR","STC"]
san_de_dicts = ["PWG","GRA","PW","CCS","SCH"]
san_lat_dicts = ["BOP"]
san_san_dicts = ["SKD","VCP"]

This was done by the included program sanhw1.py. Here's the terminal output:

> python26 sanhw1.py sanhw1.txt
44577 hws extracted from dict ../../WILScan/2014/pywork/wilhw2.txt
45205 hws extracted from dict ../../YATScan/2014/pywork/yathw2.txt
6776 hws extracted from dict ../../GSTScan/2014/pywork/gsthw2.txt
17314 hws extracted from dict ../../BENScan/2014/pywork/benhw2.txt
55379 hws extracted from dict ../../MW72Scan/2014/pywork/mw72hw2.txt
31635 hws extracted from dict ../../AP90Scan/2014/pywork/ap90hw2.txt
40081 hws extracted from dict ../../CAEScan/2014/pywork/caehw2.txt
20748 hws extracted from dict ../../MDScan/2014/pywork/mdhw2.txt
193966 hws extracted from dict ../../MWScan/2014/mwaux/mwkeys/extract_keys_b.txt
47310 hws extracted from dict ../../SHSScan/2014/pywork/shshw2.txt
17807 hws extracted from dict ../../BHSScan/2014/pywork/bhshw2.txt
36704 hws extracted from dict ../../APScan/2014/pywork/aphw2.txt
107620 hws extracted from dict ../../PDScan/2014/pywork/pdhw2.txt
19774 hws extracted from dict ../../BURScan/2013/pywork/burhw2.txt
24573 hws extracted from dict ../../STCScan/2013/pywork/stchw2.txt
122731 hws extracted from dict ../../PWGScan/2013/pywork/pwghw2.txt
10904 hws extracted from dict ../../GRAScan/2014/pywork/grahw2.txt
135785 hws extracted from dict ../../PWScan/2014/pywork/pwhw2.txt
29986 hws extracted from dict ../../CCSScan/2014/pywork/ccshw2.txt
28755 hws extracted from dict ../../SCHScan/2014/pywork/schhw2.txt
8955 hws extracted from dict ../../BOPScan/2014/pywork/bophw2.txt
42244 hws extracted from dict ../../SKDScan/2013/pywork/skdhw2.txt
48379 hws extracted from dict ../../VCPScan/2013/pywork/vcphw2.txt
409361 hws written to sanhw1.txt

The output of a typical line in sanhw1.txt is:

afRin:AP90,GST,MW,MW72,PW,PWG,SHS,STC,VCP,WIL

whose form is
<slp-headword>:CODE1,CODE2,...

The lines of sanhw1.txt are sorted in Sanskrit alphabetical order. For those headwords occurring in more than 1 dictionary, the dictionary codes are sorted in English Alphabetical order.

No 'Adjustment' to the headwords of Xhw2.txt was done . So, for instance, these two undoubtedly are different headword spellings of the same word.

aMkuwaH:AP90
   ...
aNkuwa:GST,MW,MW72,PD,PW,PWG,SHS,WIL
drdhaval2785 commented 9 years ago

Great. Now we have a file which can be used as reference to check whether the suspect errors occur in any other dictionaries or not.

gasyoun commented 9 years ago

Jim, a lovely python! 409361 words. One day we will get rid of the duplicates because of the different orthography standards and Indian - European ways of writing the same words. Still the number will be very huge to to PD, at least 100k bigger than my biggest list before.

drdhaval2785 commented 9 years ago

Great. Finally I configured my windows to run PHP using https://www.youtube.com/watch?v=neBVQBL_2P0.

Running faultfinder1.php from commandline is a fun. @funderburkjim Can we reduce still two more variables?

php faultfinder1.php dict1 dict2 dictref output

In this dict1 and dict2 are nonnegotiable. But we can remove dictref and output

dictref = dict2 and output = dict2.'vs'.dict1.'.html'

So these two are superfluous. We decide that we will store the headword lists with the code+slp.txt e.g. MWslp.txt, VCPslp.txt.

If I want to compare MW (dict2) against VCP (dict1), In current code I would have to write -> php faultfinder1.php VCP MW MW MWvsVCP.html

In the proposed alteration we would have to write -> php faultfinder1.php VCP MW

Rest two we can derive from first two arguments, right ?

gasyoun commented 9 years ago

@drdhaval2785 that sounds like an excellent idea of optimization. I would propose you make .bat file after that runs the command line commands as well. So all you will have to do is just double click the mouse.

funderburkjim commented 9 years ago

@drdhaval2785 Sure, that seems readily doable. gasyoun's comment reminds me that there are (at least) two ways to do it:

(a) It could be done by modifying faultfinder1.php.

(b) it could be done by making a batch file that used command line arguments: faultfinder1.bat VCP MW

and then, constructs arguments 3 and 4 (arg 3 = arg2, arg4 = vs.html), and then invokes the current fautlfinder1.php.

If I were doing this mod, I would probably choose method (a), since I am less familiar with dealing with batch file parameters.

You could also do method (a) so that the 3rd and 4th args are optional: Starting at lines 32,33 of faultfinder1.php, the following code probably works:

$dictref = $argv[3];
$output = $argv[4];
if  (!$output) { 
 // If there are NOT 4 arguments, compute 
 // default values for dictref and output from dicta and dictb
 $dictref = $dictb;
 $output = $dictb . "vs" . $dicta . ".html";
}
drdhaval2785 commented 9 years ago

@Jim and @gasyoun After the discussion on batch processing options, I gave it a second thought.

If properly done all arguments can be done away with We have this magic array from Jim which gives abbr for dictionary with the reference number for href.

$hrefyeardict=array("ACC" => "2014","AE" => "2014","AP" => "2014","AP90" => "2014", "BEN" => "2014","BHS" => "2014","BOP" => "2014","BOR" => "2014", "BUR" => "2013","CAE" => "2014","CCS" => "2014","GRA" => "2014", "GST" => "2014","IEG" => "2014","INM" => "2013","KRM" => "2014", "MCI" => "2014","MD" => "2014","MW" => "2014","MW72" => "2014", "MWE" => "2013","PD" => "2014","PE" => "2014","PGN" => "2014", "PUI" => "2014","PWG" => "2013","PW" => "2014","SCH" => "2014", "SHS" => "2014","SKD" => "2013","SNP" => "2014","STC" => "2013", "VCP" => "2013","VEI" => "2014","WIL" => "2014","YAT" => "2014");

Proposed method to do batch processing.

1 Jim provides list of all Headwords in text files 'dictname'.'slp.txt' -> MWslp.txt, VCPslp.txt etc. 2 We draw the dict $dicta from the key to the array given above. MW from "MW" => "2014" 3 We run a loop and pass $dictb from the rest of the array. e.g. MW is compared against all files except MW. (This way we can get ACCvsMW.html, AEvsMW.html..... YATvsMW.html) 4 We run a loop which passes $dicta also from the array. So we will be passing ACC, AE etc as base file and testing.

Once these both loops are over, we get plethora of files to check. Volunteers may try testing them. This script would be able to ease our life.

Jim would be able to add these two loops for sure.

gasyoun commented 9 years ago

Sounds magic. Is the magician still around?

funderburkjim commented 9 years ago

I've been focusing on changing the base coding from HK to SLP1, which is very tedious due to the peculiarities of the digitizations for each dictionary. Thus far I've been dealing with the dictionaries that come up in the hiatus list, and have handled GST, PW, PWG, BEN, CAE, CCS, MW72, GRA (no slp1 requiired here, since no HK), There are stil many dictionaries to go.

Regarding the idea of generating a 0-parameter function to generate a 'plethora of files to check': sure, could be done but I am doubtful of the utility. Better to do 'meaningful' comparisons one by one.

I am not sure how you have obtained MWslp.txt, VCPslp.txt, etc. I have some thoughts on that (making use of 'sanhw1.txt'), but would first like to understand how you've gotten your Xslp.txt files.

Also, @drdhaval2785 , are you looking to me to do programming here? I thought you had the programming under control.

gasyoun commented 9 years ago

plethora of files does sounds fun. If we can figure how to do it once it could come handy after. For example in finding out the level of uniqueness of each list as compared to each other. meaningful comparisons are out of trend, Jim, nowadays. You opened the box of Pandora with bulk CMD operations and now we are spoiled. HK to SLP1 sounds a lot of joy to the world so we will not disturb you in the upcoming days. If you would only make a few more videos - long time no updates. Just the way you work with files, your logic behind it, thanks for considering.

drdhaval2785 commented 9 years ago

faultfinder2.php has only two arguments. @funderburkjim's code at https://github.com/drdhaval2785/SanskritSpellCheck/issues/7#issuecomment-62490204 works fine. So now php faultfinder2.php PW VCP created VCPvsPW.html file.

Now only https://github.com/drdhaval2785/SanskritSpellCheck/issues/7#issuecomment-62495815 remains. That can wait, because I don't have SLP files for most of the dictionaries

drdhaval2785 commented 9 years ago

A partial variety of this is done in faultfinder3.php. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/37 Using sanhw1.txt instead of separate dictionaries. And comparing with MW data