adyeths / u2o

USFM to OSIS bible format converter.
The Unlicense
19 stars 6 forks source link

Orefs.py doesn't convert special refs #77

Open LAfricain opened 5 years ago

LAfricain commented 5 years ago

I have a very special type of ref in the swe1917. All books names end with a ".", ex:

Gen = Första Mosebok., 1 Mos.
Exod = Andra Mosebok., 2 Mos.
Lev = Tredje Mosebok., 3 Mos.
Num = Fjärde Mosebok., 4 Mos.
Deut = Femte Mosebok., 5 Mos.
Josh = Josua., Jos.
Judg = Domarboken., Dom.

But the character which separates multiple references is also a ".". Look: 1 Mos. 17,1. 26,24. 35,11. Then orefs.py don't convert it with the good book ref, but with the current book ref. Ex: <reference osisRef="Exod.78.44 Exod.105.29">Ps. 78,44. 105,29.<!-- orefs - unprocessed reference --> the book is Exo, but the target ref is Psa, orefs convert the ref as Exo. it add also this warning : <!-- orefs - unprocessed reference -->, I think because the refs lines end with a ".".

adyeths commented 5 years ago

I will look into trying to fix this over the next few days.

That warning is added any time orefs is not able to process a reference. That's to make it easier to locate where references were not processed so that manual fixes can be applied afterwards.

LAfricain commented 5 years ago

That warning is added any time orefs is not able to process a reference.

Yes I know, but in this case it has not to be there, because it was able to do it.

adyeths commented 5 years ago

It put the warning there because right now it thinks Ps. is a separate reference. It sees 3 references. It's entirely expected.

LAfricain commented 5 years ago

Ok I understand.

adyeths commented 5 years ago

This should be fixed now. Try it and let me know.

LAfricain commented 5 years ago

No it's not working:

<verse sID="Zech.12.10" osisID="Zech.12.10" n="10"/>Men över Davids hus och över Jerusalems invånare skall jag utgjuta en nådens och bönens ande, så att de se upp till mig, och se vem de hava stungit. Och de skola hålla dödsklagan efter honom, såsom man håller dödsklagan efter ende sonen, och skola bittert sörja honom, såsom man sörjer sin förstfödde.<note type="crossReference"><reference osisRef="Zech.6.26 Zech.39.29 Joel.2.28 Rev.8.10 Rev.19.37">Jer. 6,26. Hes. 39,29. Joel 2,28. Am. 8,10. Joh. 19,37.<!-- orefs - unprocessed reference --></reference>

It continue to read the "." only as separation between refs (only book without . are well build, see up Joel):

WARNING: Reference not processed… Jer
WARNING: Reference not processed… 52,8 f
WARNING: Reference not processed… Lam 
WARNING: Reference not processed… Ps
WARNING: Reference not processed… Jer
WARNING: Reference not processed… 25,15, 21
WARNING: Reference not processed… Lam 
WARNING: Reference not processed… Jes
WARNING: Reference not processed… Lam 
WARNING: Reference not processed… 5 Mos
WARNING: Reference not processed… 28,30 f
WARNING: Reference not processed… Lam 
WARNING: Reference not processed… 2 Mos

If you want to test https://gitlab.com/crosswire-bible-society/swe1917/tree/master/osis But don't notice the difference with some refs (the refs in the deutero are wrotten like that Mark. 3:17., and in the other book as i said already : Mark. 3,17.

adyeths commented 5 years ago

I will look at the source so I can investigate this further. Note that inconsistency in the characters used to separate parts of the references is not something that orefs can handle.

adyeths commented 5 years ago

Looking at the source, I can see why it's not working. There are far too many inconsistencies in how the references are written for orefs to reliably process them.

LAfricain commented 5 years ago

Note that inconsistency in the characters used to separate parts of the references is not something that orefs can handle.

Yes, I plan to standardize that. But it should at least treat the refs that they are well written. But that does not work either. Is this issue #68 linked?

LAfricain commented 5 years ago

I tested just with one file (then all the refs use the same form) it is the same problem. Orefs doens't recognize the ref that end with an ".".

adyeths commented 5 years ago

orefs thought there was an additional reference since the . was used to separate multiple references. I just added an adjustment to ignore empty references so that shouldn't be marked any longer.

DavidHaslam commented 5 years ago

@adyeths

Are abbreviations of compound book names that contain more than one period catered for?

Example: (Latvian Bibles) Dāv.dz. for Dāvida dziesmu grāmata (Ps.)

adyeths commented 5 years ago

@DavidHaslam The only reason that the problem with the period was problematic here was because it was also used to separate multiple references. That has been corrected. There shouldn't be any issues with abbreviations that contain more than one period.

LAfricain commented 5 years ago

I'm sorry, I made a mistake posting on the wrong issue, this message was for this issue: Even now, the problem subsist: <reference osisRef="1Kgs.29.45 1Kgs.26.11">2 Mos. 29,45. 3 Mos. 26,11.<!-- orefs - unprocessed reference --></reference> If it's not possible to manage it I can change all the refs?

adyeths commented 5 years ago

There are just too many inconsistencies in how the references are written in the swe1917 text for orefs to reliably process them. And the fragments I'm seeing posted here just aren't enough for me to figure out if I can even address the problem in orefs. I will have to wait until the inconsistencies are corrected before I can proceed further with this. (And if they can't be corrected, then the references will have to be processed manually.)

LAfricain commented 5 years ago

There are just too many inconsistencies in how the references are written in the swe1917

The only inconsistence I see it is the chapter and verse separator. I can fix this. Do you see other inconsistences? You can already have a look in the osis file. But we are almost sure the problem is the period that ends the book name. And it's a very common habit among translators. I saw it already in tree modules.

DavidHaslam commented 5 years ago

It is also seen in the Latvian Glück module that I have been working on somewhat last week to help Jānis V.

All the book abbreviations end with a period.

But the period is also used for other purposes.

  1. Between verse and another verse
  2. Between other partial references.
  3. Between complete references.
  4. Within several compound abbreviations.

Comma is used between chapter and verse.

Aside: Most confusing to read for an Englishman.

IMHO. There ought to be a rule for translators that if a book “abbreviation” is not really an abbreviation then there should be no period.

So none after Job Amos Joel (e.g) But this rule is often ignored.

Another quirk is whether or not there’s a space after the period at the end of the book abbreviation.

Messy it can be....

DavidHaslam commented 5 years ago

I should have added

  1. At the end of a reference.
DavidHaslam commented 5 years ago

I typed a 5 but CodeHub or GutHub changed it to a 1.

LAfricain commented 5 years ago

IMHO. There ought to be a rule for translators that if a book “abbreviation” is not really an abbreviation then there should be no period.

Yes! This is the difficulty standardization...

adyeths commented 5 years ago

Unless all of the characters used for separators in the references are different, orefs will not be able to process the references. There is no way around this requirement for orefs. orefs will never be able to handle all possible ways a reference can be written. It's just not possible.

I have changed orefs so it doesn't break multiple references apart before processing book abbreviations. This means it will process the references with book abbreviations that include a period even when a period is used elsewhere in the reference.

I have looked at the osis file for the swe1917 module. There is no consistency with the references in that file. Some are written one way, others are written another. It's extremely messy. orefs expects consistency in how the references are written. Without that consistency it will not be able to process the references.

LAfricain commented 5 years ago

I have changed orefs so it doesn't break multiple references apart before processing book abbreviations. This means it will process the references with book abbreviations that include a period even when a period is used elsewhere in the reference.

Currently this doesn't work:

osisRef="1Kgs.29.45 1Kgs.26.11">2 Mos. 29,45. 3 Mos. 26,11.<!-- orefs - unprocessed reference --></reference></note> <verse eID="1Kgs.6.13"/></p>

Is it possible orefs.py can "understand" that a same character can be use for two different separation. Ex, for Ndebele a period is used for chapter and verse separation, and for references separation. To be exact, the period is used when the ref is of an other book, but if the ref is of the same book Ndebele use a semi-colomn. Other question, is to be possible to add a option in orefs for the character that designates the word "following"? In French s. or ss. if they are more than 1 verse, in swedish f. and ff. Or the only wait is to fix it with an usfm tag?

adyeths commented 5 years ago

Is it possible orefs.py can "understand" that a same character can be use for two different separation.

no, it's not possible with orefs. all separation characters have to be different.

Other question, is to be possible to add a option in orefs for the character that designates the word "following"?

I will look into this. It's not something that I will be able to do quickly, though.

DavidHaslam commented 5 years ago

How is it that humans can still make sense of references where the same punctuation mark is used for different purposes?