Isidore-Guild / statenvertaling

OSIS Statenvertaling (Dutch) with apocrypha
Creative Commons Zero v1.0 Universal
5 stars 2 forks source link

Counted words lists supplied for comparison and analysis with a view to further possible corrections #12

Open DavidHaslam opened 3 years ago

DavidHaslam commented 3 years ago

@lemtom

For SWORD module DutSVV version 1.5 and my test build module DutSTV made from a lightly edited copy of your OSIS file STV.xml I have performed the following steps:

  1. Exported the full module as plain text using Sword utility diatheke.
  2. For the diatheke file from DutSTV, I removed the spaces on either side of every hyphen.
  3. Using the Tools menu in BabelPad, I generated a full Word frequency analysis for both text files.
  4. The Count column was moved to be the first column while still in the BabelPad UI.
  5. The results were sorted on the Word column while still in the BabelPad UI.
  6. Both analyses were copied to a text file and saved from Notepad++.
  7. I began to use the Compare plugin for Notepad++ to examine the differences.
  8. Both text files were compressed to Zip so that I could upload them here for your own comparative analysis.
  9. Wherever there is a difference between the two outputs is a case for further investigation using search in a text editor.
  10. Each such location is a candidate for checking against the selected "dead tree" edition that you define as a reference.

Notes:

  1. The two diatheke output files can be supplied upon request should you find them useful.
  2. The analysis ignores the diatheke case artefact noted in issue #8 as regards the divineName.
  3. This analysis concentrates on the plain verse text content only. All titles, notes, references and other XML markup is ignored.
  4. This analysis excludes the content of the DC books in file STVA.xml.
DavidHaslam commented 3 years ago

The text exported from my test module DutSTV largely excludes the acrostic titles.

cf. In the earlier SWORD module DutSVV, these acrostic titles were left simply as words embedded in verse text rather than as marked up OSIS title elements.

DavidHaslam commented 3 years ago

The earlier SWORD module DutSVV contains the following words with an ordinary ASCII apostrophe.

2   Elia's
18  Ezau's
55  Farao's
3   Lea's
1   Micha's
1   Mordechai's
10  Salomo's
1   Uria's
1   Zedekia's

Just examining one of these instances in STV.xml reveals that there is a spurious space similar to what I reported in connection with hyphens in issue #9 .

Mordechai' <w lemma="strong:H4782">s</w>

Found in Esth.2.22

This also illustrates there was a misapplication of regular expression matching during the placement of the w elements for Strong's numbers.

It casts serious doubts about the placement accuracy of the Strong's numbers in the whole file.

DavidHaslam commented 3 years ago

IMHO, the use of the ordinary ASCII apostrophe character \x27 for possessives should be deprecated in favour of U+2019 RIGHT SINGLE QUOTATION MARK as we use in the KJV module for the typographical apostrophe.

lemtom commented 3 years ago

The latest changes should resolve the issue with the apostrophes. I have also replaced the ascii apostrophe with the typographical apostrophe.

DavidHaslam commented 3 years ago

Here's one with a spurious space just after the apostrophe.

Matthew 13:32: Hetwelk wel het minste is onder al de zaden, maar wanneer het opgewassen is, dan is ’ t het meeste van de moeskruiden, en het wordt een boom, alzo dat de vogelen des hemels komen en nestelen in zijn takken.