How to redo the algo on another list?

gasyoun commented 3 years ago

Dear @funderburkjim I want to play with your scripts on different other lists, but do not understand where to start off. Where to put the input list, where to launch the redo script, please advise. Here is the top of the list I want to see broken per your rules:

saṁbhūyasamutthāna
jalaukāvacāraṇīya
kṛcchrātikṛcchra
pañcacatvāriṁśat
pratyutpannamati
sāmānādhikaraṇya
tattvākhyānopamā
saptacatvāriṁśat
pratyabhiprasthā
bhasmaśuddhikara
aṣṭacatvāriṁśat
prāthamakalpika
asaṁbhāvitopamā
citraśravastama
pratyavarodhana
samādhānarūpaka
caritrabandhaka
prajñāpāramitā
sīmantonnayana

13039-words-to-be-split.txt

funderburkjim commented 3 years ago

Sorry for delay in responding.

I don't think my scripts will be applicable to a given list of words.

The logic depends HEAVILY on

The sequence of headwords in MW,
their 'H1/2/3/4' relations'
The 'key2' splittings with dashes and @.

So I don't have a general 'samAsa' splitter .

funderburkjim commented 3 years ago

I looked at again at two of your examples: caritrabandhaka and samādhānarūpaka.

they are already headwords of MW!

So one splitting should be already present in 'key2': cari/tra—banDaka and sam-ADAna—rUpaka

That's one kind of solution.

funderburkjim commented 3 years ago

Perhaps we can discuss other possibilities further.
It should be possible to take a list of 'padas' (such as all MW headwords which are substantives), and have a program that , for a given word X, finds all pada-sequences P1,...,Pn such that X = P1 + P2 + ... + Pn. (where '+' means either concatenation or sandhi-joining).

With such a program, the samAsa analysis would not be limited just to samAsas that happen to occur as MW headwords.

For example, jambukAma (not an MW headword) could be analyzed as jambu + kAma (wish for an apple)

gasyoun commented 3 years ago

they are already headwords of MW!

The better. DSC has them unsplit, although it's all based on MW initially.

That's one kind of solution.

It would be, if you would tell a way how to feed the list. We are doing analysis of the structure of the dictionary and frequency of samāsas interconnected, all but these 4000 unsplit long words are still waiting.

Perhaps we can discuss other possibilities further.

Please take a look at https://github.com/kmadathil/sanskrit_parser/issues/164#issuecomment-805455039 and https://kmadathil.github.io/sanskrit_parser/build/html/sanskrit_parser_doc.html#algorithm-for-sandhi-split - your comment there might bring a new vision there.

With such a program, the samAsa analysis would not be limited just to samAsas that happen to occur as MW headwords.

That would be more than enough because we have never heard of a word that is not in MW, in real life batch condition.

kmadathil commented 3 years ago

It should be possible to take a list of 'padas' (such as all MW headwords which are substantives), and have a program that , for a given word X, finds all pada-sequences P1,...,Pn such that X = P1 + P2 + ... + Pn. (where '+' means either concatenation or sandhi-joining).

Bingo - that's exactly what sanskrit_parser does. We take in a phrase or sentence, not just a word, and split at all pada boundaries (sandhi or samasa), and if you wish, parse into a grammatical dependency graph.

gasyoun commented 3 years ago

Bingo - that's exactly what sanskrit_parser does.

In that case what need for 10 results if only 1-2 should be possible? Over-generation remains an issue @kmadathil.

kmadathil commented 3 years ago

Please point out any overgeneration on that project's issues list. Only legitimate pada sequences are genrerated.

gasyoun commented 3 years ago

Dhaval's script with MD has 10% words split into letters (overgeneration), rather strange: https://github.com/drdhaval2785/samasasplitter/issues/9

gasyoun commented 2 years ago

With such a program, the samAsa analysis would not be limited just to samAsas that happen to occur as MW headwords.

Can we have another take on this? It remains top5 for me. I'm stuck.

funderburkjim / MWderivations

How to redo the algo on another list? #14