Differences in verbs from $verbdata and generatedforms.xml

drdhaval2785 commented 8 years ago

https://github.com/funderburkjim/elispsanskrit/blob/master/pysanskritv1/roots/sanverb_conjtab_cp.txt has total of 74 entries excess in $verbdata in excess of generatedforms and 29 cases which appear only in $verbdata. See this issue.

@funderburkjim Majority of these were deleted from wrongformfinder.sh manually, because of some generation issues.

Verbs with upasargas like Akrand, ASAs etc.
Verbs which had different padas in the same gaNa. These gave errors and may not have derived in generatedforms.xml.

A detailed analysis is needed.

funderburkjim commented 8 years ago

Maybe you can regenerate the generated forms?

This file shows 19 cases where the information in the generated forms could not be matched by the current verbdata. I think verbdata may be ahead of generated forms, i.e., generated forms is out of sync with some recent changes in verbdata.

funderburkjim commented 8 years ago

Also, separately, noticed a likely mis-spelling in verbdata: I think verbwithoutanubandha should be saBAja rather than samAja:

saBAja:prItidarSanayoH prItisevanayorityeke:samAja:10:0429:u:sew:स॑भा॑ज॑::::saBAja1_saBAja_curAxiH+MISSING:

drdhaval2785 commented 8 years ago

@funderburkjim

I would propose thus. You work with current version of generatedforms. Note down the errors. Once you have completed comparision of pysan with SanskritVerb one round, I will regenerate. When I regenerate / do some modification in SanskritVerb algorithm / database, there may be other verbs which may get affected. It is better to do all the corrections and regenerate generatedforms.xml only after one round of corrections are already incorporated. So let us treat generatedforms.xml as version 1. Once your all corrections (based on current version) are incorporated in SanskritVerb / pysan, I will generate a version 2.

Then you rerun your comparision statistics once again and restart the game once again. e.g. 'Apa' issue - Right now only a rare form of the SanskritVerb form has this appendage. Once I do change in this algorithm, there would be a lot more new 'Apa' ending stuff like 'kaTApayati' etc. So, whether this would be useful or counterproductive, only time will tell. Right now, I tend to keep generatedforms stationary and do changes only in $verbdata. Once both of us are satisfied that first round of all corrections are made, I will rerun and generate version 2.

drdhaval2785 commented 8 years ago

saBAja rather than samAja Done.

And one good news. With PHP7 in ubuntu environment, each verb takes roughly 1 sec (all 10 tenses / moods). So the time duration for generating generatedforms.xml has come down from 1 day to 1 hour. Now it is much more amenable to changes in $verbdata.

funderburkjim commented 8 years ago

@drdhaval2785 From your comments above, I understand that you are reluctant to keep the generated forms always in sync with verbdata.

However, I think it would simplify the comparison process for these to be in sync.

One suggestion would be for me to do interim regenerations of forms (when you've made changes to verbdata). I could do this on a local branch, which would have no impact on the GitHub SanskritVerb.

I looked in the 'scripts' folded of SanskritVerb, but was not sure of how to recreate generatedforms.xml.

So the process would be:

I sync with SanskritVerb (to get new copies containing your adjustments to function.php (verbdata, etc.)
I run a script locally to remake generatedforms.xml (in a separate branch)
I rerun my comparison stuff in elispsanskrit
etc.

drdhaval2785 commented 8 years ago

I generated new forms. Would be uploading tomorrow. As I noted, time taken was large earlier. Now it is quite small. So, I will be keeping it in sync letd say weekly.

funderburkjim commented 8 years ago

How do you generate new forms? Inquiring minds want to know :)

drdhaval2785 commented 8 years ago

sh wrongformfinder.sh

This generates two files generatedforms.xml and suspectforms.txt

The suspectforms.txt file is the file where fishy forms are stored.

drdhaval2785 commented 8 years ago

Now there is a script sh verblistredo.sh which regenerates all the verb lists based on changes in $verbdata in verbdata.php file.

drdhaval2785 / SanskritVerb

Differences in verbs from $verbdata and generatedforms.xml #961