Open gasyoun opened 3 years ago
Sorry for delay in responding.
I don't think my scripts will be applicable to a given list of words.
The logic depends HEAVILY on
So I don't have a general 'samAsa' splitter .
I looked at again at two of your examples: caritrabandhaka and samādhānarūpaka.
they are already headwords of MW!
So one splitting should be already present in 'key2': cari/tra—banDaka and sam-ADAna—rUpaka
That's one kind of solution.
Perhaps we can discuss other possibilities further.
It should be possible to take a list of 'padas' (such as all MW headwords which are substantives), and have a program that , for a given word X, finds all pada-sequences P1,...,Pn
such that X = P1 + P2 + ... + Pn. (where '+' means either concatenation or sandhi-joining).
With such a program, the samAsa analysis would not be limited just to samAsas that happen to occur as MW headwords.
For example, jambukAma (not an MW headword) could be analyzed as jambu + kAma (wish for an apple)
they are already headwords of MW!
The better. DSC has them unsplit, although it's all based on MW initially.
That's one kind of solution.
It would be, if you would tell a way how to feed the list. We are doing analysis of the structure of the dictionary and frequency of samāsas interconnected, all but these 4000 unsplit long words are still waiting.
Perhaps we can discuss other possibilities further.
Please take a look at https://github.com/kmadathil/sanskrit_parser/issues/164#issuecomment-805455039 and https://kmadathil.github.io/sanskrit_parser/build/html/sanskrit_parser_doc.html#algorithm-for-sandhi-split - your comment there might bring a new vision there.
With such a program, the samAsa analysis would not be limited just to samAsas that happen to occur as MW headwords.
That would be more than enough because we have never heard of a word that is not in MW, in real life batch condition.
It should be possible to take a list of 'padas' (such as all MW headwords which are substantives), and have a program that , for a given word X, finds all pada-sequences P1,...,Pn such that X = P1 + P2 + ... + Pn. (where '+' means either concatenation or sandhi-joining).
Bingo - that's exactly what sanskrit_parser does. We take in a phrase or sentence, not just a word, and split at all pada boundaries (sandhi or samasa), and if you wish, parse into a grammatical dependency graph.
Bingo - that's exactly what sanskrit_parser does.
In that case what need for 10 results if only 1-2 should be possible? Over-generation remains an issue @kmadathil.
Please point out any overgeneration on that project's issues list. Only legitimate pada sequences are genrerated.
Dhaval's script with MD has 10% words split into letters (overgeneration), rather strange: https://github.com/drdhaval2785/samasasplitter/issues/9
With such a program, the samAsa analysis would not be limited just to samAsas that happen to occur as MW headwords.
Can we have another take on this? It remains top5 for me. I'm stuck.
Dear @funderburkjim I want to play with your scripts on different other lists, but do not understand where to start off. Where to put the input list, where to launch the redo script, please advise. Here is the top of the list I want to see broken per your rules:
13039-words-to-be-split.txt