byztxt / byzantine-majority-text

Byzantine Majority Greek New Testament text edited by Robinson and Pierpont, with morphological parsing tags and Strong's numbers
The Unlicense
55 stars 13 forks source link

Fix bug that made final sigmas appear as medials #6

Closed normansimonr closed 2 years ago

normansimonr commented 3 years ago

Thanks to @jcuenod it was detected that the BETA to CSV Unicode converter failed to convert final sigmas when the BETA code was immediately before a closing square bracket (],) which happens with some variants (issue #5). This is what Jude 5 looked like before this fix:

{Ν ὑμᾶς ἅπαξ τοῦτο ὅτι ὁ κύριος > [ὑμᾶσ] πάντα ὅτι [ὁ] κύριος ἅπαξ }

After this fix, this is what it looks like:

{Ν ὑμᾶς ἅπαξ τοῦτο ὅτι ὁ κύριος > [ὑμᾶς] πάντα ὅτι [ὁ] κύριος ἅπαξ }

Notice that the medial sigma in [ὑμᾶσ] was replaced by the correct form of the letter.

The strategy we used to fix this unwanted behaviour was to add a space between each word and any closing square brackets when the final letter is a sigma (an s in BETA code) before the conversion to Unicode. After the conversion is done, we remove the space.

We additionally include a commit that reformats the converter.py script with the black utility, to improve the readability and uniformity of the code.

Closes #5

normansimonr commented 3 years ago

@emg Hey Ulrik, I'm bumping this in case the notification got buried in your inbox again :joy:

normansimonr commented 2 years ago

@emg Hey Ulrik! Did you get a chance to review this PR?

countingpine commented 2 years ago

Hi. I was trying to merge #11 into this, and by sheer chance, both our patches touch Mark 16:8.

https://github.com/byztxt/byzantine-majority-text/blob/master/csv-unicode/with-variants/MR.csv#L667

Unfortunately, it seems this patch slightly breaks the formatting in this one place: [[ ""σηορτερ ενδινγ"" ]] -> [[ ""σηορτερ ενδινγ""]] - a space is lost before the ] at the end.

I think a workaround for this would be to replace ']' with '  ]' (two spaces before).

'  ]' doesn't occur anywhere in the beta code, so two spaces can be added and removed without breaking anything.

normansimonr commented 2 years ago

I can take a look at this next week and update the pull request. Thanks a lot @countingpine for your brilliant contributions!

emg commented 2 years ago

Yes, @countingpine – thanks a lot for your work. I appreciate it, too.

And thanks for your attitude, @normansimonr – thanking @countingpine and praising his contributions. That is the spirit of encouragement. I commend you for it.

Thanks, @normansimonr – I look forward to your looking into it, and I look forward to your PR.

countingpine commented 2 years ago

Thanks for your kind words everyone.

On another note, I've just realised: the English characters - "shorter ending" - probably shouldn't be translated at all!

I'm not sure of the best way to represent that in the final text. But to simply keep the English words, the best way would probably be to swap it for something unique and unchangeable, and swap it back again after conversion.

emg commented 2 years ago

Thanks for catching this, @countingpine. Excelent work:

On another note, I've just realised: the English characters - "shorter ending" - probably shouldn't be translated at all!

I'm not sure of the best way to represent that in the final text. But to simply keep the English words, the best way would probably be to swap it for something unique and unchangeable, and swap it back again after conversion.

Yes, that is one of the techniques for doing this. Both transformations must be done in code, of course, without the intervntion of human hands. I think that's what you mean, right?

Could I ask you to find out whether there are any other instances of English characters that need this kind of transformation, and document your findings here?

Then we can figure out who writes the code.

Many thanks, everyone! I am enjoying the conversation with you. And I highly appreciate your help.

countingpine commented 2 years ago

Thanks, I'm enjoying this conversation too.

I myself am only really starting to learn Greek (which is how I stumbled upon this project), but I'm happy to help where I can.

I've done a search for " marks, and they appear in three places:

It looks like the 2005 PDF keeps all three unconverted, so I would say the quotes are supposed to indicate "literal" sections of text.

I've also found these strings appearing in Revelation: _AX, _DAD, _IB, _KD, _MB, _RMD, _XC. The PDF simply converts these to Greek like everything else, so that might not need any special handling?

I can open one or two bug reports for them if that would be helpful. I think fixing either (if needed) would be too non-trivial to tag onto this Pull Request.

I think now the best way to conclude this particular PR is to fix the sole, minor regression (the removal of the space between ending"" and ]]), and merge with the main branch, and maybe open a bug report or something to track things that shouldn't be translated.

I can do the merge and the minor fix if that would be helpful, but of course this is @normansimonr's good work, and I don't want to presume to step on his toes.

normansimonr commented 2 years ago

@countingpine Awesome, great research! Yeah, feel free to update this PR. I agree that the issue about the stuff that must be left untranslated should take its own PR. BTW, we're thinking of creating a TEI-XML version of the ByzText as well (see #12.) Would you like to join us in giving it a shot?

emg commented 2 years ago

@countingpine Yes, awesome research, dude! You PE/TRA! :-). (Or perhaps, being a verb, one would say "SU\ PE/TREIS". Although I am not sure that the noun PE/TRA was ever made into an actual Greek verb in spoken parlance, though doubtless it could be done ;-) .)

Yes, I agree with @normansimonr – @countingpine, please add new issues for the things to leave untranslated. Many thanks.

I also think we should add the missing documentation :-). Things like the literal strings and the _XC abbreviations should be documented somewhere. I don't think they are documented at present. I will open an issue for this.

countingpine commented 2 years ago

Hi. I've uploaded a patch at https://github.com/countingpine/byzantine-majority-text/tree/pr6 .

It applies black to the master branch (incorporating my last patch) to bring it in line with the PR here.

It adds a fix for the ] issue to this PR.

Then it merges both, so it is a child of both master and this PR.

I don't know if it can be applied from here - maybe if @normansimonr pushes it to his master - but otherwise I can just make a new PR, and obsolete this one.

normansimonr commented 2 years ago

@countingpine Brilliant! Yeah, I guess the most hassle-free way to do it is just deprecate this one and make a new PR from your fork. Then we can review and @emg can merge afterwards.

Let me know when you open your PR to close this one.

normansimonr commented 2 years ago

Closed in favour of #14