adyeths / u2o

USFM to OSIS bible format converter.
The Unlicense
19 stars 6 forks source link

bad conversion of footnote references #49

Closed LAfricain closed 6 years ago

LAfricain commented 6 years ago

The conversions of : \f + \fr Est.1,1\f* or \f + \fr Est.1.1\f* don't give a true link in the osis ref: <reference type="annotateRef">Est.1.1</reference> should it be : <reference osisRef="Est.1.1"</reference>?

adyeths commented 6 years ago

The conversions of \f + \fr Est.1,1\f* should produce <reference type="annotateRef">Est.1,1</reference> and the conversion of \f + \fr Est.1.1\f* should produce <reference type="annotateRef">Est.1.1</reference> by my u2o script. It looks like that's exactly what's happening.

orefs should be able to add the appropriate osisRef attributes to those tags afterwards. That was why I wrote it. None of the references created by u2o produce actual reference links. Additional processing is needed to fix referencs.

This <reference osisRef="Est.1.1"</reference> is definitely not correct.

LAfricain commented 6 years ago

Ok, Thank you for this information. If I understand well after using u2o, I need to run on the osis file the orefs script, for linking all the crossrefence? Is it?

LAfricain commented 6 years ago

I tried orefs, this is the result:

 <note placement="foot">
          <reference type="annotateRef" osisRef="">Est.1,1<!-- orefs - unprocessed reference --></reference>
        </note>
        <verse eID="Esth.11.1-12"/>
        <chapter eID="Esth.11"/>
      </p>
      <p>
        <chapter sID="Esth.12" osisID="Esth.12" n="12"/>
        <verse sID="Esth.12.1-5" osisID="Esth.12.1 Esth.12.2 Esth.12.3 Esth.12.4 Esth.12.5" n="1-5"/>
        <note placement="foot">
          <reference type="annotateRef" osisRef="">Est.1.1<!-- orefs - unprocessed reference --></reference>

For the command: ./orefs.py -v -i ../osis/lxx.osis.xml -o ../osis/lxx.osis_ref.xml And the ufsm:

\v 1-12 \f + \fr Est.1,1\f*

\c 12
\p
\v 1-5 \f + \fr Est.1.1\f*
\c 13
\p
\v 1-7 \f + \fr  Est 3.13\f*
\v 8-18 \f + \fr Esth.4.17\f*

We are near of the goal.

adyeths commented 6 years ago

You are correct. First you run u2o to create an osis file. Then you run orefs to process the osis file and add proper osisRef attributes.

In order to process the references above you will need to use a config file. The readme for the orefs utility tries to explain this. It can automatically make one for you that you can then edit as needed too. So you don't have to manually create it.

LAfricain commented 6 years ago

Ok I generated the CONFIGFILE. I saved it in the same folder of oref.py (with the usfm). I have just ref for Esther, but the result is the same. I the name CONFIGFILE correct? Or I need to add in the oref.py script?

adyeths commented 6 years ago

The config file can be named anything you like. You just tell orefs the name of the config file to use. If you named it CONFIGFILE then you would tell orefs to use it something like this:

orefs.py -i inputfile.osis -o outputfile.osis -c CONFIGFILE

this way it will use CONFIGFILE (or whatever you choose to name it instead) for processing the references instead of trying to do it automatically using the default settings.

LAfricain commented 6 years ago

Ok it works perfectly!

LAfricain commented 6 years ago

Hello, I did new test on konvb (usfm for the kikongo). Now it is the marker \r that is converted to osis. Sometime the marker is follow with (, or text is in it, by instance: "Sea also the reference...". I have this error:

cyrille@W54:~/Documents/gitlab/konvb$ orefs.py -v -i osis/konvb.osis.xml -o ../osis/konvb_ref.osis.xml -c osis/CONFIGKONVB 
Reading input file osis/konvb.osis.xml ...
Getting book names and abbreviations...
Using config file for abbreviations...
Processing cross references...
WARNING: Reference not processed… Luke  23-38)
WARNING: Reference not processed… John  19-23)
WARNING: Reference not processed… Luke  7-9)
WARNING: Reference not processed… John  24-28)
WARNING: Reference not processed… John  29-34)
WARNING: Reference not processed… Luke  1-13)
WARNING: Reference not processed… Luke  14-15)
WARNING: Reference not processed… Luke  1-11)
Traceback (most recent call last):
  File "/home/cyrille/.bin/orefs.py", line 237, in vrschk
    rval = str(int(num))
ValueError: invalid literal for int() with base 10: ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cyrille/.bin/orefs.py", line 517, in <module>
    main()
  File "/home/cyrille/.bin/orefs.py", line 513, in main
    processfile(args)
  File "/home/cyrille/.bin/orefs.py", line 443, in processfile
    text = processreferences(text, bookabbrevs, bookabbrevs2)
  File "/home/cyrille/.bin/orefs.py", line 208, in processreferences
    lines[i] = reftag.sub(simplerepl, lines[i], 0)
  File "/home/cyrille/.bin/orefs.py", line 180, in simplerepl
    osisrefs, oreferror = getosisrefs(text, currentbook, abbr, abbr2)
  File "/home/cyrille/.bin/orefs.py", line 405, in getosisrefs
    tmp = vrschk(j)
  File "/home/cyrille/.bin/orefs.py", line 240, in vrschk
    if num[-1] in "ABCDabcd":
IndexError: string index out of range

If you need files for tests, you can find it in the gitlab repo: here The good config file (for the ref) can be found: here

adyeths commented 6 years ago

Thanks for pointing me to the files that are causing the problems. That will be very helpful in fixing the issues with processing references.

adyeths commented 6 years ago

The bug that caused orefs to crash is now corrected. The parenthesis that may surround references is now handled and no longer causing problems.

Text preceding a reference is ignored already when processing references with orefs. Text following a reference may not be. I'm unsure if I will ever be able to handle that particular situation. Which is one of the reasons I made sure to have orefs mark where references were not processed so that manual fixes can be made afterwards.

LAfricain commented 6 years ago

It runs now well, few errors can be still noticed, by instance this reference are not completly converted: \r (Mt 24, 42 ; 25, 13-15 ; Lk 12, 36-38 ; 19, 12-13) it gives thisreference type="parallel" osisRef="Luke.12.36-Luke.12.38">(Mt 24,42 ; 25, 13-15 ; Lk 12, 36-38 ; 19, 12-13)<!-- orefs - unprocessed reference --></reference> And other: <reference type="parallel" osisRef="">(Mk 1, 39 ; Lk 4,44 ye 6, 17-18)<!-- orefs - unprocessed reference --></reference> And: <reference type="parallel" osisRef="Luke.14.34-Luke.14.35">(Mk 9, 50 ; 4, 21 ; Lk 14, 34-35 ; 8, 16 ; 11, 33)<!-- orefs - unprocessed reference --></reference>

adyeths commented 6 years ago

Yes. I'm not sure why it's not handling many of those. I'm still investigating.

adyeths commented 6 years ago

Some of the references were not being processed because of whitespace that orefs was not properly handling. I have fixed that issue.

Most of the other references that aren't being processed are because of abbreviations that are not in the config file. (Mc, Lc, and Jn for example.) Add the additional abbreviations where they are needed in the config file and most of the unprocessed references will be handled.

LAfricain commented 6 years ago

Most of the other references that aren't being processed are because of abbreviations that are not in the config file. (Mc, Lc, and Jn for example.)

Yes this is my errors, I corrected it. It was abbr. in French I change it to Kikongo. Some issues remain again, like that: <reference type="parallel" osisRef="Mark.9.50 Luke.14.34-Luke.14.35">(Mk 9, 50 ; 4, 21 ; Lk 14, 34-35 ; 8, 16 ; 11, 33) I don't now if it can help you but in French, (and also in Kikongo because it is in a French language country) we use the non-break space before a ";". They are also some errors with the word "ye" (that's mean "and"), what to do with this? Just manually? See the example: <reference type="parallel" osisRef="Mark.6.7-Mark.6.11">(Mk 6,7-11 ; Lk 9,2-5 ye 10,3-12) Thank you already for performing orefs.py!

LAfricain commented 6 years ago

I noticed still this error: <reference type="parallel" osisRef="Matt.5.15 Mark.10.26 Luke.8.16-Luke.8.17 Mark.11.33">(Mt 5,15 ; 10,26 ; Lk 8,16-17 ; 11,33) The Mark.10.26 should be Matt.10.26 because it is following Matt.5.15. The same for Mark.11.33, should be Luke.11.33

adyeths commented 6 years ago

Regarding the references such as those that have ye, those would have to be manually fixed. I'm not going to be able to have orefs be generic and still be able to handle those situations.

Regarding the other error, part of the problem with that is in the way orefs handles multiple references as well as references within books when the book is not specified. It will likely be a difficult task making orefs handle this particular situation.

When I wrote orefs, I tried to make it handle situations where multiple verses and verse ranges could be specified. For this I use the SEPP separator rather than the SEPM separator. SEPM allows for multiple different references to be specified, but the book always has to be specified or it will default to whatever current book is being processed. (IN the case above, the book is Mark. Which is why it says mark in the osisRef.) Whereas, the SEPP separator allows multiple verses and verse ranges to be specified without having to repeat the book name again.

To illustrate... the reference being processed currently says this:

(Mt 5,15 ; 10,26 ; Lk 8,16-17 ; 11,33)

Since the SEPP character in your config file is . ... if the above were changed to this:

(Mt 5,15 . 10,26 ; Lk 8,16-17 . 11,33)

then it would be processed correctly by orefs. I hope the explanation makes sense. Manual fixes will have to be done if an appropriate character for SEPP can't be used here... at least for now until I can figure out a better way to do things.

LAfricain commented 6 years ago

OK your explanations are relevant, I change manually the usfm file. Thank you very much!