chchch / upama

A PHP library for comparing two or more Sanskrit TEI XML files and generating an apparatus with variants
GNU General Public License v2.0
10 stars 1 forks source link

Last character of the tree display is getting truncated #3

Closed wujastyk closed 4 years ago

wujastyk commented 4 years ago

The last character of the tree display is getting truncated. See the reading of the R manuscript here (dadasva me). The reading is correctly shown in the apparatus.

This doesn't always happen, but I haven't worked out the trigger condition.

image

chchch commented 4 years ago

Hmm... this one is slightly complicated. It occurs when you have a daṇḍa attached to the end of a word (i.e., "varānane|"). The sequence alignment algorithm (Needleman Wunsch) then attempts to align "varānane|" and "dadasva me". It gets aligned like this:

varān-ane| dadasva me

So if you don't include the daṇḍa in your highlighted selection, the "e" gets chopped off too, since the "e" is aligned with the daṇḍa. I tried reducing the gap penalty, which gives the correct alignment:

varān-an-e| dadasva me-

But I think it might mess up other alignments... usually you want the gap penalty to be pretty big. This case is also particularly difficult because "varānane" and "dadasva me" are so different that I don't know if alignment really makes sense. Anyway, there are a few options to fix this, and I'm trying to figure out what would be best... the easiest thing would be to ignore certain characters (like daṇḍas, and maybe even spaces?) Another improvement might be to consider a consonant conjunct as a single character; then "sv" could align with "n", which would allow the two "e"s to be aligned at the end without needing to add two extra gaps. But ultimately, it would be great to have a scoring matrix to tell the algorithm what kinds of substitutions should score higher (e.g., "v" and "b"). I think we need more data before we can come up with a good scoring matrix though.

In the meantime, long story short: if you include the trailing daṇḍa in your highlight, this problem won't come up. (I usually put a space between a word and a daṇḍa, so I didn't notice this before). I'll look into refining the algorithm so that it makes more sensible alignments.

wujastyk commented 4 years ago

I think ignoring daṇḍas makes sense. They're really not integral to text transmission, in most cases anyway.

I don't have a pre-daṇḍa space because in the Sanskrit 2003 font, the design is to have the daṇḍa represented by a non-spaced forward slash: "... yuytsavaḥ/" and I've got used to typing like that. But it's easy to change the input style on this.

Thanks for thinking about this! It's much more subtle that I had imagined.

Best for 2020!

Dominik

chchch commented 4 years ago

It should be fine to have no space between a word and a daṇḍa, since they get ignored during collation anyway (unless you specify otherwise). I've been using spaces around daṇḍas as a way to indicate how I interpret them though; if I put spaces around the daṇḍa, it means that I think it functions as a daṇḍa, but if I don't put a space, I'm indicating that I think it should be read as a scribal error or something. For example, I get things like "paryāyāḥ|s tattvam" or "niḥ|sattāsattam" where it looks like the scribe just added a daṇḍa after a visarga by default.

wujastyk commented 4 years ago

ah, I see. That's quite nice. Let me know in the longer run what usage you decide is best.

chchch commented 4 years ago

Fixed for now; the matrix and stemma view now filter out daṇḍas and other extraneous characters when it aligns partial lemmata. I'll work on a better collation algorithm for the future.