duplicates in verbdata.txt

funderburkjim commented 8 years ago

In preparation for comparing the pysan conjugation algorithms with those of SanskritVerb, there is some analysis of the $verbdata element of SanskritVerb program function.php.

This data is extracted as file verbdata.txt.

Analysis of verbdata was made for duplicates, in two ways.

In both cases, the significance (if any) of these duplicates is a question to me.

funderburkjim commented 8 years ago

duplicate verb without anubandha

Each of the verbdata.txt records has the root spelled in two ways:

with anubandha(s)
without anubandha(s)

Refer to verbdata_dupnorm.txt.

It was found that in 19 cases, a given spelling-with-anubandha was associated with more than one spelling-without-anubandha.

The presence of these duplicates provides an obstacle to corresponding the records of verbdata.txt with roots indicated in dictionaries such as that of Monier-Williams.

funderburkjim commented 8 years ago

duplicate `sutra` numbers

Refer to verbdata_dupsutra.txt.

Sutra number in this discussion means an amalgam of the gana (conjugation class) and (sequence) number of verbdata.txt records. (This file identifies the fields that comprise a record of verbdata.txt.)

There were 43 cases where a given sutra number appears in more than one record of verbdata.txt.

I presume that some set of parameters derived from the fields of verbdata.txt should identify a particular entity which we call a root. A priori, I expected that the gana-sequence number (sutra number) would be such a parameter, but the presence of duplicates shows that it is not. However, the fact that the number of duplicates (43) is quite small (2% of 2213 records of verbdata.txt) indicates that the sutra number is almost an identifier.

On the other hand, the verb-with-anubandha is also not a unique identifier of the cases of verbdata.txt.

Is there a generally accepted dhatu identifier ? Is this identifier present in verbdata.txt?

drdhaval2785 commented 8 years ago

Good to see cleaning up of $verbdata. I did some cleanup manually as and when I came across the errors. But systematic study like this will definitely clean up in big way.

If you do some additional analysis, keep me posted. I will correct in my data also.

drdhaval2785 commented 8 years ago

It was found that in 19 cases, a given spelling-with-anubandha was associated with more than one spelling-without-anubandha.

Majority of them seem to be errors. Will keep you posted when I correct these entries in function.php. You can regenerate later on.

drdhaval2785 commented 8 years ago

Is there a generally accepted dhatu identifier ?

From my experience, it would be gana,sutranumber,pada,iDAgama,meaning. The reason behind the meaning coming in this is - there are places where different commentators assign a different meaning to the same verb. If we are not able to have a separate verb entry with separate verb number, it is possible that sutranumber is identical, but meaning is separate.

There can be genuine tagging errors of the database maker also. Need to examine these entries individually.

Is this identifier present in verbdata.txt?

Yes, the identifier seems to be present in verbdata.txt.

drdhaval2785 commented 8 years ago

https://github.com/funderburkjim/elispsanskrit/issues/32#issuecomment-243000152 Corrections started. Changes noted here. For your reference, the base dhAtupATha which Mihail has based his numbers seem to be the following. dhatupatha_svara.pdf Majority of root numbers tally with this.

YimidA!:Bid,mid:Changed to mid. Bid was error. kaWi!:utkaRW,kaRW:Removed ut. It was upasarga. o!laqi!:olaRq,laRq:Removed o. vella!:vell,vehl:Separate verbs vella! and vehla!. Correction to vehla! pasi!:paMS,paMs:Separate verbs pasi! and paSi!. Correction to paSi! barha!:barh,varh:Separate verbs barha! and varha!. Correction to varha! bfhi!:bfMh,vfMhःSeparate verbs bfhi! and vfhi!. Correction to vfhi! DUpa!:Dup,DUp:Tricky. There are two verbs on the same number. Alternate forms. See image. Right now changing to Dupa!. Will have to take a call. capture vehf!:beh,veh:Same as above. Changed to behf! DU:Du,DUःSame as above. Changed to Du. mana!:man,mAnःChanged to mAna! IKi!:IK,INK:Changed to IKa! pelf!:pall,pel:Changed to palla! taqa!:taq,taRq:Changed to taq vasa!:vas,vasa:Changed to vas aqqa!:aqq,adq:Changed to adqa! visa!:bis,vis:Changed to vis bisa!:bis,vis:Changed to bis mAna!:man,mAn:Changed to mAn

drdhaval2785 commented 8 years ago

19 entries of 'duplicate verb without anubandha' is corrected in $verbdata now. https://github.com/funderburkjim/elispsanskrit/issues/32#issuecomment-243000513 pending. @funderburkjim will you please regenerate the statistics after this first round of corrections?

I am sure, some increase will be seen in the 'duplicate sutra number' lot after the first round of corrections.

funderburkjim commented 8 years ago

@drdhaval2785 Regenerated

verbdata.txt (from revised SanskritVerb/scripts/function.php)
verbdata_dupnorm.txt. Now has 2 cases, 17 removed
verbdata_dupsutra.txt. Same number (42). But some revision of data for these, per the verbdata changes.

funderburkjim commented 8 years ago

dhatupatha_svara.pdf

The link to this is a new form to me for GitHub.
https://github.com/funderburkjim/elispsanskrit/files/448794/dhatupatha_svara.pdf When I clicked, it downloaded the file. This must be some GitHub service. Is there a link on how to use the 'files' service?

Dhatupathas are generally associated with some scholar's name, as I understand it. For instance, there is the mADavIyaDAtupAWa, the Westergaard Dhatupatha, probably many other Sanskrit scholars both modern and from antiquity. To which scholar do we attribute dhatupatha_svara.pdf?

drdhaval2785 commented 8 years ago

When I clicked, it downloaded the file. This must be some GitHub service. Is there a link on how to use the 'files' service?

Drag and drop in the issue text box. Nothing further.

To which scholar do we attribute dhatupatha_svara.pdf?

I seriously do not know. It is available on sanskritdocuments.org I guess. No metadata in the file.

gasyoun commented 8 years ago

To which scholar do we attribute dhatupatha_svara.pdf?

When I last met Mihas in Moscow he told me that there have been 3 sources. The main source is Katre (https://yadi.sk/i/4kO_OF81uhGer and https://yadi.sk/i/klN3jLERuhGh9) , the others two for reference I do not remember, but one could ask Mihas by mail. We are no more in contact as he is on Ukraine's side (being in Belarussia), I'm - Russia's.

funderburkjim commented 8 years ago

So dhatupatha_svara.pdf was produced by 'Mihas' ?

Sad about the Ukraine issue.

funderburkjim commented 8 years ago

Here are four cases that might need correction in verbdata; (from sanverb_cp_log.txt)

case 1 of duplicate verbdata key: vraRa!.01.0519.P
   vraRa!:SabdArTaH:vraR:01:0519:pa:sew:व्र॑णँ॑:277:290:293:vraN1_vraNaz_BvAxiH+SabxArWaH:
   vraRa!:SabdArTaH:vraR:01:0519:pa:sew:व्र॑णँ॑:277:290:293:vraN1_vraNaz_BvAxiH+SabxArWaH:
case 2 of duplicate verbdata key: kaWi!.10.0385.U
   kaWi!:Soke prAyeRotpUrva utkaRWAvacanaH:kaRW:10:0385:u:sew:क॑ठिँ॑:1362:1378:1415:kaNT2_kaTiz_curAxiH+Soke:
   kaWi!:Soke prAyeRotpUrva utkaRWAvacanaH:kaRW:10:0385:u:sew:क॑ठिँ॑:1362:1378:1415:kaNT2_kaTiz_curAxiH+Soke:
case 3 of duplicate verbdata key: DUpa!.10.0303.U
   DUpa!:Dupa!' BAzArTaH:Dup:10:0303:u:sew:धू॑पँ॑:1321::1374:XUp2_XUpaz_curAxiH+BARArWaH:
   DUpa!:BAzArTaH:DUp:10:0303:u:sew:धू॑पँ॑:1321::1374:XUp2_XUpaz_curAxiH+BARArWaH:
case 4 of duplicate verbdata key: granTa!.10.0362.U
   granTa!:banDane:granT:10:0362:u:sew:ग्र॑न्थँ॑:1342,1353:1368:1395,1406:granW3_granWaz_curAxiH+sanxarBe:261
   granTa!:sandarBe:granT:10:0362:u:sew:ग्र॑न्थँ॑:1342,1353:1368:1395,1406:granW3_gran```

drdhaval2785 commented 8 years ago

vraRa! and kaWi!

Duplicates - removed one entry.

DUpa!

This is typical. There are two verbs in the same number. There are some such cases. I propose to do it 10.0303a. @funderburkjim what is your take? capture

granTa!

granTa! banDane is 10.0362. granTa! sandarBe is 10.0375. Corrected in function.php

funderburkjim commented 8 years ago

Re: Drag and drop in the issue text box. Nothing further.

Thanks. Useful idea.

funderburkjim commented 8 years ago

Regarding the DUpa! case, where given sutra has two root forms.

DUpa!:Dupa!' BAzArTaH:Dup:10:0303:u:sew:धू॑पँ॑:1321::1374:XUp2_XUpaz_curAxiH+BARArWaH:
DUpa!:BAzArTaH:DUp:10:0303:u:sew:धू॑पँ॑:1321::1374:XUp2_XUpaz_curAxiH+BARArWaH:

Maybe change these to

Dupa!:BAzArTaH:Dup:10:0303:u:sew:धु॑पँ॑:1321::1374:XUp2_XUpaz_curAxiH+BARArWaH:
DUpa!:BAzArTaH:DUp:10:0303:u:sew:धू॑पँ॑:1321::1374:XUp2_XUpaz_curAxiH+BARArWaH:

I would hold off distinguishing these further by 10:0303a on one of them
since the 10.0303 is probably a key into a printing of the dhatupatha, and adding an 'a' would confuse the construction of this key.
And there's also the fact that the 0303 is a sequence number.

funderburkjim commented 8 years ago

Further comment/question re DUpa!

In looking at dhatupatha_svara.pdf, there are many cases like 10:0303, in the sense of having the form

gana.number root (root1)

10:0305  cIva! (cIba!)
cIva!:BAzArTaH:cIv:10:0305:u:sew:ची॑वँ॑:1321::1374:cIv2_cIvaz_curAxiH+BARArWaH:

01:0105  zvaska! (zvazka!)
zvaska!:gatyarTaH:svazk:01:0105:A:sew:ष्व॑स्कँ॒:::::

(Numerous other examples)

So, if DUpa! were handled in verbdata like those other two instances, then there would be only ONE record for it in verbdata.

This is just an observation regarding some formal comparisons. I don't know the significance of all the pieces, so do not have a definite opinion

drdhaval2785 commented 8 years ago

@funderburkjim,

It actually transpires that there are many such cases in dhatupatha_svara.pdf. And not all of them were given a separate headword status e.g. there are no zvazka! or cIba! verbs in database. So, best is to remove the Dup from database for consistency.

drdhaval2785 commented 8 years ago

Regenerated the data.

funderburkjim / elispsanskrit