MarjorieBurghart / VulgateGlaire

Une version TEI XML de la traduction française de la Vulgate (Bible latine) par l'Abbé Glaire (†1879)
2 stars 3 forks source link

A counted words list to assist with proof reading #16

Open DavidHaslam opened 7 years ago

DavidHaslam commented 7 years ago

The attached text file contains a tab delimited text file that counts all the words found in verse text of the VulgateGlaire.

merged.words.count.txt

The output file is automatically sorted on the words field, though the collation algorithm probably does not match that applicable for the French language.

This list is provided to assist with proof reading. It's a powerful analysis method for detecting typos and spelling mistakes.

Punctuation marks other than hyphen/minus and the right single quotation mark (used as the typographical apostrophe) were removed.

Browsing through the list, take particular note of the hapax legomena, of which there are 12564. That's a staggering 40% of the total number of listed words.

Although many words are found only once in all Bibles, some of these instances may be erroneous.

DavidHaslam commented 7 years ago

Of course, some of the unexpected "words" will turn out to be artefacts of the use of parentheses for alternate renderings. I expect this will turnout to be the case for the last item in the list: 00001 z-vous

On the other hand, some items will turn out to be real typos.

DavidHaslam commented 7 years ago

FIO. The next file is just a character frequency count of the counted words list.

merged.words.count.character.frequency.txt

DavidHaslam commented 7 years ago

FYI. Here's the 52 search results for the regexp [A-Z]\w+(-[A-Z]\w+)+ from the counted words list.

00001   Asason-Thamar
00001   Assieds-Toi
00001   Astaroth-Carnaïm
00002   Ataroth-Addar
00001   Azanoth-Thabor
00001   Baalath-Béer
00002   Baal-Hermon
00002   Baal-Pharasim
00001   Baal-Salisa
00002   Beth-Araba
00003   Beth-Hagla
00002   Beth-Maacha
00003   Cariath-Arbé
00001   Cariath-Baal
00004   Cariath-Sépher
00001   Carioth-Hesron
00001   d’Asor-Haddan
00002   Esprit-Saint
00001   Es-Tu
00002   Etes-Vous
00001   Evil-Mérodach
00001   Grande-Ourse
00001   Hammoth-Dor
00002   Havoth-Jaïr
00001   Homme-Beau
00219   Jésus-Christ
00007   Jabès-Galaad
00015   Jean-Baptiste
00079   l’Esprit-Saint
00002   L’Esprit-Saint
00001   Lésem-Dan
00007   Marie-Madeleine
00001   Néphat-Dor
00001   Nathan-Mélech
00030   Notre-Seigneur
00006   Phahath-Moab
00001   Rabbath-Ammon
00001   Ramathaïm-Sophim
00016   Ramoth-Galaad
00018   Saint-Esprit
00001   Savé-Cariathaïm
00018   Simon-Pierre
00001   Sochoth-Bénoth
00002   Suis-Moi
00001   Suivez-Moi
00003   Théglath-Phalasar
00001   Thamnath-Saraa
00003   Thelgath-Phalnasar
00001   Tob-Adonias
00045   Tout-Puissant
00001   Très-Fort
00091   Très-Haut

Most of these are hyphenated proper names.

DavidHaslam commented 7 years ago

Some hyphenated proper names have been translated rather than transliterated from the Hebrew.

The notable one with 3 hyphens in Isaiah 8:3 is one such example: \v 3 et je m’approchai de la prophétesse, et elle conçut et enfanta un fils. Alors le Seigneur me dit : Donne-lui pour nom : Hâtez-vous (Hâte-toi) de saisir les dépouilles, pille(z) promptement ;

cf. Many Bibles have Maher-shalal-hash-baz here, with the meaning of the name given in a footnote.

DavidHaslam commented 7 years ago

Aside: It goes almost without saying that the method to produce the counted words list was made much simpler by having fixed issue #4 because I did not have to treat \x27 as a special case.