adyeths / u2o

USFM to OSIS bible format converter.
The Unlicense
19 stars 6 forks source link

README-orefs.md - Please add something about punctuated abbreviations, etc. #45

Closed DavidHaslam closed 6 years ago

DavidHaslam commented 6 years ago

The example config file lists book abbreviations that have no full-stop.

Some translations terminate abbreviations with a full-stop. Many do not.

Though this should not matter to the script, it matters to users making their own config file.

Particular care is required to avoid adding a full-stop to those shorter booknames such as Amos and Job where the abbreviation is identical to the name, when that's what's used in references.

It should go without saying that no two book abbreviations should be the same.

Include a reminder about this risk for the abbreviations of Judges, Judith & Jude or Philippians & Philemon.

btw. Though the example abbreviations are all 3 characters, this is not a requirement! cf. I have encountered a set of abbreviations in which the shortest was the single character J for John.

adyeths commented 6 years ago

The example config file is just that, an example. It doesn't dictate how the abbreviations are supposed to be composed, how many entries there should be, or what characters are allowed in the abbreviations. People should use whatever abbreviations they used in the text, including punctuation. So if, for example, they used Judg. for Judges, then that's what belongs in the config file for orefs. And if they used multiple abbreviations for a particular book, they should add multiple abbreviations to the config file. (Like I did for Wisdom of Solomon in the example config file.)

I would be very cautious running orefs on a text that uses a single character abbreviation! That could produce very unpredictable results. (Most likely causing many references to fail to be converted to osisRef attributes.)

DavidHaslam commented 6 years ago

The following tab delimited table may satisfy your curiosity with regard to my reported encounter with single letter abbreviations.

Name    Number  ID  Abbr.   osisID  Polish  Verbose book title  pl-utf8.conf    PolGdansk book title    Year    MM-DD
Genesis 01  GEN Rdz Gen Ks. Rodzaju Księga Rodzaju  Rodzaju 1 Mojżeszowa    2017    04-08
Exodus  02  EXO Wj  Exod    Ks. Wyjścia Księga Wyjścia  Wyjścia 2 Mojżeszowa    2017    04-08
Leviticus   03  LEV Kpł Lev Ks. Kapłańska   Księga Kapłańska    Kapłańska   3 Mojżeszowa    2017    04-03
Numbers 04  NUM Lb  Num Ks. Liczb   Księga Liczb    Liczb   4 Mojżeszowa    2017    04-08
Deuteronomy 05  DEU Pwt Deut    Ks. Powt. Prawa Księga Powtórzonego Prawa   Powtórzonego Prawa  5 Mojżeszowa    2017    04-08
Joshua  06  JOS Joz Josh    Ks. Jozuego Księga Jozuego  Jozuego Jozuego 2017    03-28
Judges  07  JDG Sdz Judg    Ks. Sędziów Księga Sędziów  Sędziów Sędziów 2017    03-28
Ruth    08  RUT Rt  Ruth    Ks. Rut Księga Rut  Rut Ruty    2017    03-28
1 Samuel    09  1SA 1Sm 1Sam    I Ks. Samuela   I Księga Samuela    1 Samuela   1 Samuelowa 2017    04-08
2 Samuel    10  2SA 2Sm 2Sam    II Ks. Samuela  II Księga Samuela   2 Samuela   2 Samuelowa 2017    04-08
1 Kings 11  1KI 1Krl    1Kgs    I Ks. Królewska I Księga Królewska  1 Królewska 1 Królewska 2017    03-28
2 Kings 12  2KI 2Krl    2Kgs    II Ks. Królewska    II Księga Królewska 2 Królewska 2 Królewska 2017    04-08
1 Chronicles    13  1CH 1Krn    1Chr    I Ks. Kronik    I Księga Kronik 1 Kronik    1 Kronik    2017    04-04
2 Chronicles    14  2CH 2Krn    2Chr    II Ks. Kronik   II Księga Kronik    2 Kronik    2 Kronik    2017    04-08
Ezra    15  EZR Ezd Ezra    Ks. Ezdrasza    Księga Ezdrasza Ezdrasza    Ezdraszowa  2017    03-28
Nehemiah    16  NEH Ne  Neh Ks. Nehemiasza  Księga Nehemiasza   Nehemiasza  Nehemijaszowa   2017    04-08
Esther  17  EST Est Esth    Ks. Estery  Księga Estery   Estery  Estery  2017    03-28
Job 18  JOB Hi  Job Ks. Hioba   Księga Hioba    Hioba   Ijobowa 2017    04-08
Psalms  19  PSA Ps  Ps  Ks. Psalmów Księga Psalmów  Psalmów Psalmów 2017    04-08
Proverbs    20  PRO Prz Prov    Ks. Przysłów    Księga Przysłów Przysłów    Przypowieści Salomonowych   2017    03-28
Ecclesiastes    21  ECC Kaz Eccl    Ks. Kaznodziei  Księga Kaznodziei   Koheleta    Kaznodziei Salomona 2017    04-08
Song of Songs   22  SNG Pnp Song    Pieśń nad Pieśniami Pieśń nad Pieśniami Pieśni nad pieśniami    Pieśń Salomona  2017    03-28
Isaiah  23  ISA Iz  Isa Ks. Izajasza    Księga Izajasza Izajasza    Izajasz 2017    04-08
Jeremiah    24  JER Jr  Jer Ks. Jeremiasza  Księga Jeremiasza   Jeremiasza  Jeremijasz  2017    04-08
Lamentations    25  LAM Lm  Lam Lamentacje  Lamentacje  Lamentacje  Treny Jeremijaszowe 2017    03-28
Ezekiel 26  EZK Ez  Ezek    Ks. Ezechiela   Księga Ezechiela    Ezechiela   Ezechyjel   2017    03-30
Daniel  27  DAN Dn  Dan Ks. Daniela Księga Daniela  Daniela Danijel 2017    04-08
Hosea   28  HOS Oz  Hos Ks. Ozeasza Księga Ozeasza  Ozeasza Ozeasz  2017    03-28
Joel    29  JOL Jl  Joel    Ks. Joela   Księga Joela    Joela   Joel    2017    03-28
Amos    30  AMO Am  Amos    Ks. Amosa   Księga Amosa    Amosa   Amos    2017    03-28
Obadiah 31  OBA Ab  Obad    Ks. Abdiasza    Księga Abdiasza Abdiasza    Abdyjasz    2017    03-28
Jonah   32  JON Jon Jonah   Ks. Jonasza Księga Jonasza  Jonasza Jonasz  2017    03-28
Micah   33  MIC Mi  Mic Ks. Micheasza   Księga Micheasza    Micheasza   Micheasz    2017    03-28
Nahum   34  NAM Na  Nah Ks. Nahuma  Księga Nahuma   Nahuma  Nahum   2017    03-28
Habakkuk    35  HAB Ha  Hab Ks. Habakuka    Księga Habakuka Habakuka    Abakuk  2017    03-28
Zephaniah   36  ZEP So  Zeph    Ks. Sofoniasza  Księga Sofoniasza   Sofoniasza  Sofonijasz  2017    03-28
Haggai  37  HAG Ag  Hag Ks. Aggeusza    Księga Aggeusza Aggeusza    Aggieusz    2017    03-28
Zechariah   38  ZEC Za  Zech    Ks. Zachariasza Księga Zachariasza  Zachariasza Zacharyjasz 2017    03-28
Malachi 39  MAL Ml  Mal Ks. Malachiasza Księga Malachiasza  Malachiasza Malachyjasz 2017    03-28
Matthew 41  MAT Mt  Matt    Ew. Mateusza    Ewangelia według świętego Mateusza  Mateusza    Mateusza    2017    03-29
Mark    42  MRK Mk  Mark    Ew. Marka   Ewangelia według świętego Marka Marka   Marka   2017    04-08
Luke    43  LUK Łk  Luke    Ew. Łukasza Ewangelia według świętego Łukasza   Łukasza Łukasza 2017    03-29
John    44  JHN J   John    Ew. Jana    Ewangelia według świętego Jana  Jana    Jana    2017    03-29
Acts    45  ACT Dz  Acts    Dzieje Apostolskie  Dzieje Apostolskie  Dzieje  Dzieje Apostolskie  2017    03-30
Romans  46  ROM Rz  Rom List do Rzymian List świętego Pawła apostoła do Rzymian Rzymian Rzymian 2017    03-29
1 Corinthians   47  1CO 1Kor    1Cor    I List do Koryntian Pierwszy list świętego Pawła apostoła do Koryntian  1 Koryntian 1 Koryntów  2017    04-08
2 Corinthians   48  2CO 2Kor    2Cor    II List do Koryntian    Drugi list świętego Pawła apostoła do Koryntian 2 Koryntian 2 Koryntów  2017    04-08
Galatians   49  GAL Ga  Gal List do Galatów List świętego Pawła apostoła do Galacjan    Galatów Galatów 2017    03-29
Ephesians   50  EPH Ef  Eph List do Efezjan List świętego Pawła apostoła do Efezjan Efezjan Efezów  2017    03-29
Philippians 51  PHP Flp Phil    List do Filipian    List świętego Pawła apostoła do Filipian    Filipian    Filipensów  2017    03-29
Colossians  52  COL Kol Col List do Kolosan List świętego Pawła apostoła do Kolosan Kolosan Kolosensów  2013    03-29
1 Thessalonians 53  1TH 1Tes    1Thess  I List do Tesaloniczan  Pierwszy list świętego Pawła apostoła do Tesaloniczan   1 Tesaloniczan  1 Tesalonicensów    2017    03-29
2 Thessalonians 54  2TH 2Tes    2Thess  II List do Tesaloniczan Drugi list świętego Pawła apostoła do Tesaloniczan  2 Tesaloniczan  2 Tesalonicensów    2017    03-29
1 Timothy   55  1TI 1Tm 1Tim    I List do Tymoteusza    Pierwszy list świętego Pawła apostoła do Tymoteusza 1 Tymoteusza    1 Tymoteusza    2013    03-29
2 Timothy   56  2TI 2Tm 2Tim    II List do Tymoteusza   Drugi list świętego Pawła apostoła do Tymoteusza    2 Tymoteusza    2 Tymoteusza    2017    03-29
Titus   57  TIT Tt  Titus   List do Tytusa  List świętego Pawła apostoła do Tytusa  Tytusa  Tytusa  2017    03-29
Philemon    58  PHM Flm Phlm    List do Filemona    List świętego Pawła apostoła do Filemona    Filemona    Filemona    2017    03-29
Hebrews 59  HEB Hbr Heb List do Hebrajczyków    List świętego Pawła apostoła do Hebrajczyków    Hebrajczyków    Żydów   2017    03-29
James   60  JAS Jk  Jas List Jakuba List świętego Jakuba apostoła   Jakuba  Jakóba  2017    03-29
1 Peter 61  1PE 1P  1Pet    I List Piotra   Pierwszy list świętego Piotra apostoła  1 Piotra    1 Piotra    2017    03-29
2 Peter 62  2PE 2P  2Pet    II List Piotra  Drugi list świętego Piotra apostoła 2 Piotra    2 Piotra    2017    03-29
1 John  63  1JN 1J  1John   I List Jana Pierwszy list świętego Jana apostoła    1 Jana  1 Jana  2017    03-29
2 John  64  2JN 2J  2John   II List Jana    Drugi list świętego Jana apostoła   2 Jana  2 Jana  2017    03-29
3 John  65  3JN 3J  3John   III List Jana   Trzeci list świętego Jana apostoła  3 Jana  3 Jana  2017    03-30
Jude    66  JUD Jud Jude    List Judy   List świętego Judy apostoła Judy    Judasa  2017    03-29
Revelation  67  REV Obj Rev Objawiene Jana  Objawienie świętego Jana apostoła   Apokalipsa  Objawienie Jana 2017    03-29

In this environment, the tabs soon go askew, so you may find it more convenient to view the data in a spreadsheet.

DavidHaslam commented 6 years ago

UBG2017 Booknames.zip

adyeths commented 6 years ago

Yeah, that would definitely cause problems with orefs. Single letter abbreviations are a very bad idea. (And it's very unlikely that I would be able to make orefs work properly for something like this.)

DavidHaslam commented 6 years ago

Is that because the one letter abbreviation J is the first character of at least one two letter abbreviation Jk ?

Or is it because of something much more general in scope?

One other difficulty I had to tackle was that 2 abbreviations had characters with diacritics. Kpł and Łk

I'm guessing that your script is bulletproof for the whole of Unicode?

adyeths commented 6 years ago

Yes, that's exactly why. The J being the first character of other abbreviations is what will cause problems.

Yes, it should be bulletproof for the whole of Unicode. Diacritics shouldn't matter.

DavidHaslam commented 6 years ago

Why not just use an option for whole word matching (if there is such a thing in Python) ?

Then J could never match Jk – this is a rather basic concept, surely?

adyeths commented 6 years ago

I used the fastest method possible in orefs, which was a simple string replacement rather than a regular expression. In order to match whole words I would have to switch to using regular expressions as that would be the only way possible to do that. I'm not sure how complicated that will be yet. (And I prefer not using regular expressions if at all possible. They're slower for modifying text in python. It's also far too easy to mess them up and create additional problems that are hard to track down.)

adyeths commented 6 years ago

I changed the code to use a simple regular expression to match the book abbreviations. It might allow this particular single character abbreviation to work without causing others to fail. Further testing would be needed. I only confirmed that switching to a regular expression didn't break anything that already worked for me. (Single character abbreviations are still a very bad idea.)

DavidHaslam commented 6 years ago

A single character abbreviation is reality for one book in this Polish version.

It's completely unambiguous for the human reader.

It not so bad an idea for anyone but a computer programmer.

From the translators' viewpoint, it's a very efficient abbreviation scheme with the minimum average number of characters.

DavidHaslam commented 6 years ago

What would Samuel Finley Breese Morse have said? :)

adyeths commented 6 years ago

I'm not sure what he would have said, it would no doubt involve a lot of dots and dashes though. ;)