christos-c / bible-corpus

A multilingual parallel corpus created from translations of the Bible.
Creative Commons Zero v1.0 Universal
172 stars 47 forks source link

Jointed verses in the Japanese-tok Bible #26

Closed morethanbooks closed 6 months ago

morethanbooks commented 6 months ago

That's a new kind of issue :) In the Japanese-tok Bible, some elements seem to have the text of more than two verses. The verse id in the is something like "b.NUM.15.4 15:5". Although this does not need to represent an error, it definitely makes it harder to compare these verses with the rest of the corpus. I would consider splitting the text of these verses into different elements if its possible.

morethanbooks commented 6 months ago

Here is a list of the verses with this problem.

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.NUM.15.4 15:5'
Match:  
Start location: 12955:28
Offset: 680041
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.1CH.16.12 16:13'
Match:  
Start location: 33670:29
Offset: 1844723
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PSA.49.8 49:9'
Match:  
Start location: 45676:28
Offset: 2439432
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PSA.58.4 58:5'
Match:  
Start location: 46081:28
Offset: 2457967
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PSA.63.5 63:6'
Match:  
Start location: 46276:28
Offset: 2467049
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PSA.65.2 65:3'
Match:  
Start location: 46333:28
Offset: 2469602
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PSA.76.8 76:9'
Match:  
Start location: 47044:28
Offset: 2501466
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PSA.89.50 89:51'
Match:  
Start location: 47941:29
Offset: 2540746
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PSA.105.5 105:6'
Match:  
Start location: 48691:29
Offset: 2572713
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PSA.106.21 106:22'
Match:  
Start location: 48874:30
Offset: 2580305
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PSA.132.3 132:4 132:5'
Match:  
Start location: 50395:29
Offset: 2643528
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PSA.132.3 132:4 132:5'
Match:  
Start location: 50395:35
Offset: 2643534
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.PRO.26.18 26:19'
Match:  
Start location: 53539:29
Offset: 2774202
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.ISA.4.3 4:4'
Match:  
Start location: 55363:27
Offset: 2862917
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.ROM.1.9 1:10'
Match:  
Start location: 87127:27
Offset: 4586703
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.ROM.2.19 2:20'
Match:  
Start location: 87253:28
Offset: 4593364
Length: 1

System ID: P:\bible-corpus\TEI\Japanese-tok.xml
Description: xml:id='b.ROM.16.25 16:26'
Match:  
Start location: 88432:29
Offset: 4656797
Length: 1
christos-c commented 6 months ago

These verses are merged in the original version so it's the same problem we had in Galela (#24). I've used the same solution (kept only the first index) and will add a note to this effect once we finalise the placement.