bdenckla / MAM-for-Sefaria

Miqra According to the Masorah for Sefaria
Other
1 stars 0 forks source link

Want to use your text as codex option for TorahBibleCodes software #1

Closed TorahBibleCodes closed 4 months ago

TorahBibleCodes commented 6 months ago

Want to use your text as codex option for TorahBibleCodes software

Do you have plain text available without diacritic niqqud?

Niqqud ok if plain text not available.

We want to incorporate this codex into our free open source program for Hebrew Bible Research:

Github.com/TorahBibleCodes/TorahBibleCodes

bdenckla commented 6 months ago

There is currently no edition (version) (export) of MAM without niqqud.

Can you describe to me the type of format you want? I might be able to easily create an export with that format.

I'm guessing you want a very plain format. Some questions:

I'm guessing you want only ketiv (no qere)?

I'm guessing you want only Torah?

I'm guessing you want no representation of "paragraph" separations (setumah/petuḥah)?

I'm guessing you want no notes?

I'm guessing you want no encoding of special letters (small/large/hung)?

In general I'm guessing you want only the letters א thru ת and space? (E.g. not even sof pasuq.)

TorahBibleCodes commented 6 months ago

בע״ה

Our program can parse out all the niqqud.

We want entire Tanach.

Let’s see what you have, and we can deal with it like we did with Sefaria ‘s Leningrad codex, and the Michigan Claremont transliteration of the koren (since koren won’t cooperate to share their actual digital Hebrew codex with anyone).

Some of the things you mentioned to be nice to have, so please instruct us how to obtain.

GitHub.com/TorahBibleCodes

Thanks.

bdenckla commented 6 months ago

MAM is available in various formats; this GitHub repository happens to have the Sefaria format, which is CSV at the top level and contains HTML in the main cell of each row. See the out/csv directory of this repo.

For many more details of that HTML see, the github-pages site associated with this repo.

Other notable formats are MAM-XML and MAM-parsed; links are provided in this repo's README.

bdenckla commented 6 months ago

Tell me more about the Michigan-Claremont (M-C) transliteration of the Koren, BTW. I never heard of that. My first reaction is that it poses some interesting copyright issues. For instance, I've always guessed that M-C transcription/encoding of BHS (now called WLC but called different things in the past) was done with permission of the German Bible Society, but that's just a guess.

TorahBibleCodes commented 6 months ago

I read that your mam version is free license. Correct? Is this the sefaria version you point me to?

TorahBibleCodes commented 6 months ago

Here is Michigan Claremont for Koren: https://users.cecs.anu.edu.au/~bdm/dilugim/StatSci/data.html

The BHS is Leningrad = WLC

TorahBibleCodes commented 6 months ago

I read that your text is not yet finalized and changes are expected. Is this true? I am looking for version that is ready and error free.

bdenckla commented 6 months ago

I read that your mam version is free license. Correct? Is this the sefaria version you point me to?

MAM is licensed free in the sense of "free of charge": you do not have to pay to use MAM. But there are some conditions of its license, the CC-BY-SA license. The two most important such conditions are the "BY" (Attribution) part and the "SA" (ShareAlike) part.

All versions of MAM, because of the "SA" (ShareAlike) part, "inherit" that license. So yes, the Sefaria version in this GitHub repository has the CC-BY-SA license.

bdenckla commented 6 months ago

I read that your text is not yet finalized and changes are expected. Is this true? I am looking for version that is ready and error free.

We have no plans to finalize or "finish" MAM. The text is always being improved and corrected, and we hope it always will be, even beyond our stewardship of it. Like WLC, whose first version was I think in the year 1987 but continues to be revised up to today (2024).

We view this continuous process as a feature, not a bug.

All that having been said, MAM is ready to be used, and is being used. By Sefaria, by JPS, by AlHaTorah, and others. In software terms, MAM is not "in beta." I.e. it is not in the beta phase of development. It is officially released. But unlike most paper books, which never have a revised edition, it is being revised constantly, like most software.

As to "error free"... I would never make that claim for it. That having been said, with regard to the ketiv letter-text, which is I think your only concern, changes are very rare and will only get rarer. It is quite possible that there will never be another change to the letter-text.

Though unlikely, if a letter-text change were to happen, my guess is that the most likely form it would take would be where we or future stewards (maintainers) decide to combine two words together or split two words apart. I.e. what is today two ketiv words (maybe separated by a maqaf in the qere, maybe separated by just a space) might become one word, i.e. might be changed to have no separation at all. Or vice versa: what is today one ketiv word might become two ketiv words, separated in the qere by either a maqaf or just a space.

TorahBibleCodes commented 5 months ago

Hi, I looked at the text, but the html parsing unnecessarily complex.

Do you have just Hebrew text only, no html?

Json ok CSV ok

bdenckla commented 5 months ago

See if the CSV-AJF format suits your needs better.

TorahBibleCodes commented 4 months ago

Hi, Hag Pesach Kasher ve-Sameach.

Sorry for the late reply.

I have found something interesting in the text that you provide.

Your MAM has a Numbers 25:19 which also is the exact same mistake in the Claremont Michigan Transliteration of the Koren codex.

I published findings with data in the following paper and data repository that prove that the Claremont Michigan Transliteration of the Koren is flawed with its Numbers 25:19 verse division.

I confirmed with my own copies of Koren bibles, that there is no Numbers 25:19, yet I see it in the CMT as well as in your MAM - so I wonder if there is a common source to these two codices?

Therefore, I wonder if your MAM takes its Torah from the same Claremont Michigan Transliteration of the Koren Codex? And I wonder is your MAM in the NACH section also the CMT of the entire Koren Codex???

https://www.academia.edu/104334275

https://github.com/TorahBibleCodes/ResearchData_KorenCodexVsLeningradCodex

YOUR MAM-AJF: https://github.com/bdenckla/MAM-for-Sefaria/blob/main/out/csv-ajf/Numbers.csv

bdenckla commented 4 months ago

There is no mistake in any edition of MAM regarding where Numbers 25 ends (and where Numbers 26 starts). It is just one of those unfortunate places where there are different traditions as to where to place the verse numbers.

The AJF edition of MAM uses one tradition, let's call it the BHS tradition, for where it places the boundary between Numbers chapters 25 and 26. Other editions of MAM use another tradition.

I suppose the פסקא באמצע פסוק is what gives rise to the tradition where Numbers 25 ends in the middle of a chanted verse. I.e. I suppose the פסקא באמצע פסוק is what gives rise to the tradition where Numbers 25 ends at an atnaḥ rather than at a silluq/sof pasuq.

If the Michigan-Claremont transcription of the Koren diverges from the Koren, and does not document its divergence, then yes, that is an error in that transcription. MAM has nothing to do with that transcription.

TorahBibleCodes commented 4 months ago

Hi,

You misunderstood.

The Koren and Leningrad (BHS) both do not have Numbers 25:19 - they have Numbers 25:18 followed by Numbers 26:1.

However, the CMT does make the mistake of putting the SOF PSUK after the first few words of 25:18 and makes a new verse: 25:19.

However, the actual KOREN does not have this error or verse 25:19.

So if your text has this same error, it leads me to reason that your MAM/AJF is from the flawed CMT.

Both actual Hebrew Koren and Leningrad (BHS) do not have this error.

After I get your text parsed and integrated into the TorahBibleCodes program as a choice of codices, I will be able to confirm the precise letter comparison of MAM vs. the Koren vs. Leningrad to see how it compares, and if indeed it is the same text as the Koren which also claims to be from the Masorti Tradition - yet Koren was only published in the 1950s, so I wonder what is the source text of the Koren: could it be the Aleppo Codex?

bdenckla commented 4 months ago

The Koren and Leningrad (BHS) both do not have Numbers 25:19 - they have Numbers 25:18 followed by Numbers 26:1.

BHS absolutely has a few words that are identified as Numbers 25:19. I can post a picture in here if you like.

I don't know what you mean by referring to Leningrad in this context since of course the manuscript has no verse numbers.

And just to be clear, we are not talking about any editions missing (or adding) any CONTENT (words) here, we are just talking about where chapter and verse numbers are placed within the same words.

bdenckla commented 4 months ago

So if your text has this same error, it leads me to reason that your MAM/AJF is from the flawed CMT.

I see no error in any edition of MAM regarding this issue of the division between chapters 25 and 26 of Numbers. What do you think the error is (or might be)?

I can say with confidence that MAM is unrelated to the Michigan-Claremont transcription of the Koren.

MAM is distantly related to the famous Michigan-Claremont transcription of the BHS body text. (That transcription is now known as WLC but it has been called other things in its long past. Its first version came out in 1987 I believe.) Although MAM is only distantly related to WLC, we do still occasionally find errors in MAM inherited from WLC. (Or, sometimes we find things that are not exactly errors, but they are things that are not appropriate to MAM.)

So, MAM does inherit stuff from WLC. But there's no way MAM can inherit an error (or anything else) from the Michigan-Claremont transcription of the Koren because MAM is unrelated to that text.

TorahBibleCodes commented 4 months ago

For our TorahBibleCodes bible search software, we offer a choice of codices:

  1. The Leningrad Codex (BHS) from Sefaria of the entire Tanach, i.e. the WLC.
  2. (Since Koren does not agree to share their text for research) - The Claremont Michigan Transliteration of the Koren Torah (only) text that was used by Witztum, Rips, and Rosenberg (WRR) as well as their critics McKay et al. to run their Torah Codes experiments.

We would like to find the original Koren and/or other Masorti/Traditional Codex(ices) to offer the users the choice of Bible Codices to conduct their research.

Your MAM is suitable, and from the evidence, it seems that the MAM Torah is either the same as Koren (CMT), or taken from the same source that divided verses in exactly this way. I am curious to know the origin of the error, curious to know if your MAM is the same as Koren CMT as well as Koren (actual Hebrew) in letters and verse count.

I am working on parsing your MAM-AJF files, and using Numbers as my prototype to integrate your codex texts into the program.

TorahBibleCodes commented 4 months ago

Do you have another format perhaps of only the text letters, i.e. without the diacritic marks of NIQUD pronunciation marks?

TorahBibleCodes commented 4 months ago

It sounds from your previous answer above that you reckon that maybe your MAM is indeed from the WLC (BHS) Leningrad codex.

In the link above to the research data of Koren (CMT) vs. Leningrad (WLC / BHS), I publish the exact letter differences between the two codices of Leningrad vs. Koren (CMT).

Leningrad does not have this error of Numbers 25:19 that is within the Koren (CMT) so therefore the MAM code would suggest that your MAM is close to the Koren (CMT) codex - i.e. not Leningrad, but rather Koren.

I am wondering about the research question if the Koren / MAM (Masorti/Traditional) version of the Masoretic Text that is used by Jews is indeed a copy of the LOST ALEPPO CODEX... It was lost at about the same time that Koren first published (1950s).

The Leningrad has some differences in spelling from the Koren, but very minor: only matres lectionis, and thus prove how faithful ancient Jewish Scribes were in their copies: Leningrad vs. Koren is only different in YUDs and VAVs mostly, and a handful of ALEFs and HEYs - i.e. matres lectionis.

TorahBibleCodes commented 4 months ago

If you open your personal copy of the Koren Tanach, and open to Numbers 25, you will see that it ends at 25:18, and no 25:19 exists in the actual physical book.

It also does not exist in the Leningrad codex.

This proves to me that the Numbers 25:19 in Koren (CMT) is an error since otherwise the Koren vs. Leningrad agree in exact number of verses for the Torah.

bdenckla commented 4 months ago
  1. The Leningrad Codex (BHS) from Sefaria of the entire Tanach, i.e. the WLC.

Be careful, most editions derived closely from WLC inherit from WLC two verses in Joshua that are not present in the Leningrad Codex or in any of the codices of its (Tiberian Masoretic) tradition. So if you want your text to be as close to Leningrad as possible, take out those two verses.

TorahBibleCodes commented 4 months ago

While reading the CSV files for this MAM-AJF version, it is importing a LIST of LISTs.

i.e. each row of cells in the CSV is a LIST of 4 list elements (columns from the CSV file).

The CSV file contains first column of Book/Chapter/Verse, second column of string of text, and third and fourth columns that are blank/empty - so I have to parse these out by looping.

Why the empty columns in the data file?

It would be less parsing work if the file just contained the two columns of data.

And perhaps you would be interested in adopting the mathematical ID system to identify Bible Book, Chapter, Verse?

3-INTEGER TUPLE for BOOK, CHAPTER, VERSE:

(4, 25, 18) = Numbers 25:18

TorahBibleCodes commented 4 months ago

Which verses in Joshua?

We use the Leningrad Codex (BHS) provided by Sefaria.

bdenckla commented 4 months ago

If you open your personal copy of the Koren Tanach, and open to Numbers 25, you will see that it ends at 25:18, and no 25:19 exists in the actual physical book.

It also does not exist in the Leningrad codex.

What does not exist in the Leningrad Codex? Surely you don't mean verse numbers since it has no verse numbers. What words are you claiming are different, in any of these editions? I haven't looked at the M-C transcription of Koren, but as far as I know there is no difference between any editions of the words here. There is just a difference in where verse and chapter numbers are placed.

This proves to me that the Numbers 25:19 in Koren (CMT) is an error since otherwise the Koren vs. Leningrad agree in exact number of verses for the Torah.

I don't think that is a reliable check. Differences in verse numbering can easily cancel out. Again, there is no error here, just a difference in verse numbering. I suppose if the M-C transcription of Koren claimed to have Koren verse numbering, then that is an error. Or if it has a sof pasuq, as I think you mentioned, that is an error.

TorahBibleCodes commented 4 months ago

By extrapolating this mathematical ID system to letters and words, it is possible to give each letter and word in the TANACH a unique ID, and this makes possible exact letter-to-letter comparisons between codices, e.g. Leningrad vs. Koren (CMT):

This mathematical ID system as a standard way to enable scientific biblical research is introduced here: https://www.academia.edu/104334275

TorahBibleCodes commented 4 months ago

Numbers 25:19 does not exist in the Leningrad or the Koren (actual Hebrew) - but it does exist in the Koren (CMT).

Please download the Koren and Leningrad files for Numbers here, and you can see the letter by letter comparison that our program enables - here are the exact differences for all letters in the Book of Numbers (and entire Torah):

https://github.com/TorahBibleCodes/ResearchData_KorenCodexVsLeningradCodex

TorahBibleCodes commented 4 months ago

I haven't yet found a difference between the words, but I do see minor differences in spelling only of matres lectionis that happened no doubt in copying.

I just go by the math (please see the data in the appendix in the paper) that shows there are differences in the number of letters and words between the two codices.

The difference of only one (1) verse is explained by this error found in the Koren (CMT) of Numbers 25:19 which does not exist in the Koren (actual Hebrew book) or the Leningrad (BHS / WLC - provided by Sefaria).

bdenckla commented 4 months ago

Numbers 25:19 does not exist in the Leningrad or the Koren (actual Hebrew) - but it does exist in the Koren (CMT).

What do you mean by Numbers 25:19? The number "19" or the words that the M-C transcription of Koren has for that verse?

I can't speak for the M-C transcription of Koren, but the words of BHS Numbers 25:19 certainly are present in all reliable Masoretic editions. The only thing that varies is that those words MAY OR MAY NOT BE LABELLED with that verse number!

TorahBibleCodes commented 4 months ago

Please look at this file which contains all verses that contain a difference of even one letter between the Koren (CMT) and the Leningrad (WLC / BHS).

I did this for each book in the Torah, so in that folder are the exact differences between these two codices. As soon as I get your MAM-AJF version parsed and integrated into the program, I will run the script again that compares each codex letter by letter and provides a set of verses that are different between the two - otherwise, they are exactly the same letter by letter.

https://github.com/TorahBibleCodes/ResearchData_KorenCodexVsLeningradCodex/blob/main/USER_FILE_Analysis4_Numbers_KorenVsLeningrad.csv

Leningrad Codex;;There is no verse (4, 25, 19) in the Leningrad Codex.;1492;(4, 25, 19) (4, 25, 19);Koren Codex;13;ויהי אחרי המגפה;1495;(4, 25, 19) (4, 26, 1);Leningrad Codex;49;ויהי אחרי המגפה ויאמר יהוה אל משה ואל אלעזר בן אהרן הכהן לאמר;1541;ויהי אחרי המגפה (4, 26, 1);Koren Codex;36;ויאמר יהוה אל משה ואל אלעזר בן אהרן הכהן לאמר;1531;ויהי אחרי המגפה

TorahBibleCodes commented 4 months ago

I just examined (and compared) a few verses of your file for Numbers with my file of the Koren vs. Leningrad Codices in the link above, and it is evident that for at least 3 verses, your MAM-AJF version is identical with the Koren (CMT) in both letters and verses.

The Koren (CMT) is likely exactly correct in the letters; it is only the strange addition of Numbers 25:19 which appears as Numbers 26:1 in the Leningrad (WLC/BHS) as well as in the Koren (actual Hebrew book).

The working theory is that your MAM-AJF (hopefully for the entire Tanach beyond just Torah) is exactly the same as Koren Codex which claims to be the Masorti Traditional version of what Jews have used for centuries.

I know that the Aleppo Codex was the standard for Rambam and perhaps Jews of the entire region here in Eastern Hemisphere Israel, Africa, Middle-East, Europe...

I know that the Aleppo Codex was lost after the State of Israel was established, and then it resurfaced with the Torah section missing.

I know that Koren's first publication from Israel as the first edition published in Eretz Israel was in the 1950s.

...so I wonder if the Koren edition is a copy of the Aleppo Codex... ?

TorahBibleCodes commented 4 months ago

It would be great if you could adopt and use the proposed, standard mathematical ID system for each data object in your repos and how you think of Letter Objects mainly, but also basic data objects of lists and dictionaries, unique mathematical ID keys, etc.:

  1. (Book, Chapter, Verse), e.g. (4, 25, 19) = Numbers 25:19; (4, 26, 1) = Numbers 26:1
  2. (Book, Chapter, Verse, LetterInVerse, LetterInText)
  3. (Book, Chapter, Verse, WordInVerse, WordInText)

These standard mathematical ID keys give each book, chapter, verse, letter, and word a unique ID that is also human friendly to easily scan the first 3 integers of the tuple(s); this allows exact letter-by-letter (and word-by-word) comparison between any codex(ices) that use this mathematical key to ID each letter/word/verse/chapter/book.

bdenckla commented 4 months ago

It would be great if you could adopt and use the proposed, standard mathematical ID system for each data object in your repos and how you think of Letter Objects mainly, but also basic data objects of lists and dictionaries, unique mathematical ID keys, etc.:

  1. (Book, Chapter, Verse), e.g. (4, 25, 19) = Numbers 25:19; (4, 26, 1) = Numbers 26:1

All our editions have some notion of book, chapter, and verse and all these notions are trivially convertible amongst each other, modulo the differences due to verse numbering traditions, a problem which your scheme does not address anyway.

So, sorry to say, we have no interest in switching existing editions to your proposal or creating one or more new editions using your proposal.

  1. (Book, Chapter, Verse, LetterInVerse, LetterInText)

Your interest in letter-level structure is specific to your application and applications like it and thus, due to lack of generality, we have no interest in supporting it.

  1. (Book, Chapter, Verse, WordInVerse, WordInText)

We have experimented with word-level representations and indeed use them to some extent, informally, since many MAM templates apply only to words or phrases. And indeed, we use word-level representations to a complete extent in some of our applications based on MAM. (We use them "on the fly" (in memory) not recorded in files.)

But "word," as well-defined as it may seem, turns out to be a somewhat application-specific and complex notion. So, we have no interest in making an edition available that provides word segmentation. My current work on phonetic transcription may, in the future, provide some word-segmented output, with a definition of "word" suitable to that limited application. But it would not be a full representation of MAM's contents, i.e. it would be far from a full edition of MAM. (For example, it would have no ketiv words, as they are irrelevant to pronunciation!)

bdenckla commented 4 months ago

I just examined (and compared) a few verses of your file for Numbers with my file of the Koren vs. Leningrad Codices in the link above, and it is evident that for at least 3 verses, your MAM-AJF version is identical with the Koren (CMT) in both letters and verses.

Sure, I would expect pretty good agreement between the letter-text of any two editions of the Torah, except for possibly in the handful-or-so of known differences between scroll (letter-text) traditions. And except for the few known places of different verse numbering schemes, I would also expect agreement. Off the top of my head, in addition to the Numbers 25/26 numbering difference, editions vary widely in their verse-numbering of the Decalogues.

The Koren (CMT) is likely exactly correct in the letters; it is only the strange addition of Numbers 25:19 which appears as Numbers 26:1 in the Leningrad (WLC/BHS) as well as in the Koren (actual Hebrew book).

Again, you keep referring to this ambiguously as if maybe a verse has been added. No verse has been added. Nothing is strange here. Simply a label (a verse number) has been added, according to a well-known verse-numbering scheme (let's call it the BHS scheme, although likely it predates BHS by many decades or even centuries).

The working theory is that your MAM-AJF (hopefully for the entire Tanach beyond just Torah) is exactly the same as Koren Codex which claims to be the Masorti Traditional version of what Jews have used for centuries.

Beyond Torah I think you will find some differences in the letter-text.

Also, be aware that you are using "Masorti" in a slightly strange-sounding way to an English speaker (or at least to this English speaker, i.e. to me). I think you mean a closely-related word (but with different connotations): "Masoretic". The word "Masorti" is mainly used in English to refer to a specific denomination of Judaism, mainly restricted to North America, also known as the Conservative denomination or Conservative movement.

...so I wonder if the Koren edition is a copy of the Aleppo Codex... ?

Modern versions of Koren may be influenced in some places by the scholarship which resulted from studying the Aleppo Codex. But Koren does not purport to stick closely to the Aleppo Codex in its letters or pointing. MAM does purport to stick closely to AC in its letters and pointing, where AC is extant, and attempts to reconstruct AC where AC is missing. There are well-documented exceptions to MAM's adherence to AC, like MAM's use of sheva instead of a ḥataf vowel in certain places. But I think you are only concerned with the letter text so that particular exception would likely be of little or no concern to you.

bdenckla commented 4 months ago

Please look at this file which contains all verses that contain a difference of even one letter between the Koren (CMT) and the Leningrad (WLC / BHS).

I don't find that file to easy to immediately understand without some tutorial guidance.

Please I would advocate you stop referring to your WLC text as "Leningrad"; it causes you to say things like "there is no Numbers 25:19 in Leningrad". Saying things like that is either meaningless or misleading. (Meaningless to claim anything about verse numbers in Leningrad since it doesn't have verse numbers; misleading to claim that verse contents is missing since no contents are missing.)

Also "Koren Codex" is a weird term. "Codex" is not usually used to describe modern printed editions.

bdenckla commented 4 months ago

We use the Leningrad Codex (BHS) provided by Sefaria.

As I assume your work aims to be precise about the exact contents of these files, you should be very clear about where these files came from and what they contain. As such, again I urge you to certainly not call this "Leningrad" and probably not BHS. It is good to mention that you got it from Sefaria; I suspect Sefaria refers to it as "the tanach.us text" or similar so that would be one good way to refer to it ("the tanach.us text from Sefaria" or similar). You could also refer to it as WLC. And somewhere you should document a DATE associated with it, as tanach.us has evolved rapidly beyond the version Sefaria has. The version Sefaria has is frozen in time. WLC evolves, too.

Which verses in Joshua [are additions]?

You should study all available documentation about the texts you use, including the following:

https://tanach.us/Pages/Changes.html

Though that mostly documents changes that happened AFTER the Sefaria version was frozen, it contains tidbits which should be of great interest to you such as this one:

The UXLC strives to replicate the text of the Leningrad Codex rather than providing a eclectic edition derived from multiple texts. Previous editions of the UXLC (and the WLC before it) have contained two verses in Joshua, Joshua 21:36 - 37, which are not in the Leningrad Codex. The two verses are related to, but not equal to, 1 Chronicles 6:63 - 64; their origin is unknown. Ben Denckla and Seth (Avi) Kadish have pointed out this discrepancy to the publisher. The verses remain in the UXLC to preserve verse numbering and to be compatible with other texts. However, they are now marked with a new transcription note "X" and the text color has been set to gray to alert the reader.

Also in the header of the main file of WLC 4.22 (which you can get from the Groves Center if you ask):

NOTE: This file includes Joshua 21:36-37, just as previous versions have always done. Those two verses are not found in the Leningrad Codex (or in the Aleppo Codex or in most early codices) but are found in later manuscripts and printed editions of the Hebrew Bible.

bdenckla commented 4 months ago

i.e. each row of cells in the CSV is a LIST of 4 list elements (columns from the CSV file).

The CSV file contains first column of Book/Chapter/Verse, second column of string of text, and third and fourth columns that are blank/empty - so I have to parse these out by looping.

Why the empty columns in the data file?

It would be less parsing work if the file just contained the two columns of data.

For future questions/suggestions, please open up separate GitHub issues for separate questions/suggestions about MAM-for-Sefaria. This thread is out of control, so unwieldly that it is becoming unusable.

Yet, I'll make the situation worse by answering this question here.

The third and fourth columns contain the separate cantillations for those few verses (all in Torah) that contain two cantillations. E.g. Exodus 20:2

Exodus 20:2,אָֽנֹכִ֖י֙ יְהֹוָ֣ה אֱלֹהֶ֑֔יךָ אֲשֶׁ֧ר הוֹצֵאתִ֛יךָ מֵאֶ֥רֶץ מִצְרַ֖יִם מִבֵּ֣֥ית עֲבָדִ֑͏ֽים׃,אָֽנֹכִי֙ יְהֹוָ֣ה אֱלֹהֶ֔יךָ אֲשֶׁ֧ר הוֹצֵאתִ֛יךָ מֵאֶ֥רֶץ מִצְרַ֖יִם מִבֵּ֣ית עֲבָדִ֑ים,אָֽנֹכִ֖י יְהֹוָ֣ה אֱלֹהֶ֑יךָ אֲשֶׁ֧ר הוֹצֵאתִ֛יךָ מֵאֶ֥רֶץ מִצְרַ֖יִם מִבֵּ֥ית עֲבָדִֽים׃

This should present no problem to any CSV-reading library. Or are you trying to parse the CSV "manually," without help of a library?

bdenckla commented 4 months ago

Do you have another format perhaps of only the text letters, i.e. without the diacritic marks of NIQUD pronunciation marks?

We have no such format, and no plans to introduce such a format.

It is trivial, in any language supporting regular expressions, for the user of any MAM format to create their own such format. For example in Python:

def letters_and_maqafs(string: str):
    """ Return only the letters and maqaf marks in the given string """
    # I.e. strip out any vowel points, accents, etc.
    pattern = r'[^א-ת־]*'
    return re.sub(pattern, '', string)

def letters(string: str):
    """ Return only the letters in the given string """
    # I.e. strip out any vowel points, accents, maqaf marks, etc.
    pattern = r'[^א-ת]*'
    return re.sub(pattern, '', string)
bdenckla commented 4 months ago

It sounds from your previous answer above that you reckon that maybe your MAM is indeed from the WLC (BHS) Leningrad codex.

Yes, MAM is derived from WLC, but I would guess that it has hundreds of thousands of differences from WLC. As I said, it is only distantly related to WLC. A distant cousin, not a child. As far as letter-text differences from the ketiv text of WLC, I would guess MAM has several orders of magnitude fewer. Like in the 100 to 1000 range.

bdenckla commented 4 months ago

the CMT does make the mistake of putting the SOF PSUK after the first few words of 25:18 and makes a new verse: 25:19.

I have no idea what you're talking about. I downloaded numbers.koren.gz from the overall page you provided a link to.

Overcoming its weirdly-backwards numbers, I find that:

The four lines in question (25:18 is split across two lines) are as follows:

4 52 81 KY CRRYM HM LKM BNKLYHM )$R NKLW LKM (L DBR P(WR W(L DBR KZBY BT N$Y)
4 52 81 MDYN )XTM HMKH BYWM HMGPH (L DBR P(WR
4 52 91 WYHY )XRY HMGPH
4 62 1 WY)MR YHWH )L M$H W)L )L(ZR BN )HRN HKHN L)MR

These four lines correspond to the following Hebrew letter text (at least they correspond roughly, I haven't checked in detail):

(Well, above isn't strictly the letter-text since I added maqaf marks.)

Here as elsewhere in this comment thread, you are confusing the following two notions of "verse," which are usually the same but NOT ALWAYS THE SAME:

As I have said elsewhere on this comment thread (we're starting to go in loops, rehashing the same points, for which I am partly to blame), it is a separate question as to whether what advertises itself as a transcription of the letter-text of Koren should use a verse numbering other than Koren's. Perhaps they used BHS-style verse numbering in order to be more easily comparable with other texts (notably, WLC!) that use BHS-style verse numbering.

TorahBibleCodes commented 4 months ago

I will integrate your MAM text into the program, and show you the data.

The precise mathematics of each letter and word and verse will clear this up if you think we are going around in loops.

Simply the Leningrad from Sefaria (wlc BHS?) does not have Numbers 25:19. It does not exist there or Koren actual Hebrew; I have only seen it in the CMT and now your MAM which so far appears to be identical in letters to koren actual Hebrew… which I theorize may be a copy of the now-lost Aleppo codex Torah.

On Wed, 24 Apr 2024 at 16:28 Ben Denckla @.***> wrote:

the CMT does make the mistake of putting the SOF PSUK after the first few words of 25:18 and makes a new verse: 25:19.

I have no idea what you're talking about. I downloaded numbers.koren.gz https://users.cecs.anu.edu.au/~bdm/dilugim/StatSci/numbers.koren.gz from the overall page https://users.cecs.anu.edu.au/~bdm/dilugim/StatSci/data.html you provided a link to.

Overcoming its weirdly-backwards numbers, I find that:

  • Numbers 25:18, 25:19, and 26:1 have the expected contents (at least roughly) for an edition with BHS-style verse numbers.
  • No Michigan-Claremont codes for sof pasuq (00) are used in the entire file for Numbers.

The four lines in question (25:18 is split across two lines) are as follows:

4 52 81 KY CRRYM HM LKM BNKLYHM )$R NKLW LKM (L DBR P(WR W(L DBR KZBY BT N$Y) 4 52 81 MDYN )XTM HMKH BYWM HMGPH (L DBR P(WR 4 52 91 WYHY )XRY HMGPH 4 62 1 WY)MR YHWH )L M$H W)L )L(ZR BN )HRN HKHN L)MR

These four lines correspond to the following Hebrew letter (+maqaf) text (at least they correspond roughly, I haven't checked in detail):

  • 25:18 כי צררים הם לכם בנכליהם אשר־נכלו לכם על־דבר־פעור ועל־דבר כזבי בת־נשיא מדין אחתם המכה ביום־המגפה על־דבר־פעור
  • 25:19 ויהי אחרי המגפה
  • 26:1 ויאמר יהוה אל־משה ואל אלעזר בן־אהרן הכהן לאמר

Here as elsewhere in this comment thread, you are confusing the following two notions of "verse," which are usually the same but NOT ALWAYS THE SAME:

  • a CHANTED VERSE, i.e. a span of words from one sof pasuq to the next
  • a NUMBERED VERSE, i.e. a span of words from one verse number to the next

As I have said elsewhere on this comment thread (we're starting to go in loops, rehashing the same points, for which I am partly to blame), it is a separate question as to whether what advertises itself as a transcription of the letter-text of Koren should use a verse numbering other than Koren's. Perhaps they used BHS-style verse numbering in order to be more easily comparable with other texts (notably, WLC!) that use BHS-style verse numbering.

— Reply to this email directly, view it on GitHub https://github.com/bdenckla/MAM-for-Sefaria/issues/1#issuecomment-2074951684, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7UWEEM7ZTJPJ6ONB5AECLY66XPFAVCNFSM6AAAAABDSROQKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUHE2TCNRYGQ . You are receiving this because you authored the thread.Message ID: @.***>

bdenckla commented 4 months ago

Simply the Leningrad from Sefaria (wlc BHS?) does not have Numbers 25:19.

You are not expressing yourself clearly by using vague phrases like "does not have Numbers 25:19."

I have repeatedly encouraged you to make distinctions between CONTENTS (Hebrew text) and LABELS (numbers).

As I have repeatedly said, "does not have Numbers 25:19" could mean (at least) to things:

For the first meaning ("Does not have the three words ..."), your claim is false. Sefaria's tanach.us text has those three words. They just happen to be the first three words after the label for Numbers 26:1 in that edition.

For the second meaning ("Does not have a LABEL, '19' ..."), your claim is true. But presumably this fact is irrelevant to any of your purposes. People disagree as to whether or not Torah was given by G-d. Everyone agrees verse number labels certainly were not. They are a relatively recent and certainly-human invention.

Just choose a verse-numbering convention and apply that convention to any texts you want to compare. It doesn't matter what convention you chose; it only matters that you use it uniformly.

bdenckla commented 4 months ago

I theorize [that the Koren text] may be a copy of the now-lost Aleppo codex Torah.

There's no evidence to support that speculation, nor will any such evidence likely ever surface. So, there's really no point in making such a speculation except to fuel yet another feverish conspiracy theory surrounding the Aleppo Codex.

The evidence against that speculation is just kind of common sense:

If the Koren publishing house at some point in time had access to the now-missing parts of the Aleppo Codex, don't you think they'd boast about it in their marketing materials?

I suppose you could say that maybe they had access but had to keep that access secret.

In that case, why would they bother to do the work of transcribing, if they had to keep it secret? I.e. why bother to do that work if it doesn't help sell your product?

TorahBibleCodes commented 4 months ago

Yes, 25:19 breaks the uniformity.

https://tanach.us

This placement of sof psuk here is a mistake in wherever it originated; the Tanach on Sefaria’s GitHub ends at 25:18 and 26:1 contains those three words - yes, this is understood.

There are exact numbers of letters, words, and verses - this is important.

Breaking the verse 26:1 into two by creating 25:19 is a editing mistake of addition/corruption to the original text; just read 26:1 in the koren to see the reason why someone put a sof psuk there.

If you look at the koren, you see the reason they did this is because the paragraph / psuk / line does carriage return line break, but the verse continues until the markings of two dots to mark the official end of verses.

Verses are from Sinai; only chapter numbers are Christian addition.

bdenckla commented 4 months ago

https://tanach.us

I am not sure what you mean by including this URL. It is a great site; I have devoted a lot of work to it through my contributions to it. But I don't see what point you're trying to make by mentioning it.

This placement of sof psuk here is a mistake in wherever it originated

You are using "sof psuk" in an ambiguous way. I would usually assume sof psuk (more often transliterated sof pasuq) to mean the end of a CHANTED verse, not the end of a NUMBERED verse. In many contexts it specifically means the colon-like mark that (along with silluq) indicates the end of a chanted verse.

No editions disagree about the division of this span of text into chanted verses. So, since they all agree, there can be no mistakes in any editions, unless you think they are all wrong.

Some editions disagree about the division of this span of text into numbered verses. Again, there can be no mistakes in any editions, but this this time the reason is that the division of the text into numbered verses is a question of taste, not a question of right and wrong.

Perhaps you are narrowly speaking about the Michigan-Claremont-format transcription of the Koren letter-text. Since it does not use the Michigan-Claremont code for sof pasuq (00), perhaps you could say it is in error since perhaps you could say that the files imply that every numbered verse is also a chanted verse. A more charitable interpretation of those files is that they make no representation about the boundaries of chanted verses, so they cannot be in error about such a boundary question.

Your repeated failure (or is it refusal?) to distinguish the notions of CHANTED verse and NUMBERED verse make this conversation unlikely to be productive. Yet, I am still trying to explain.

Verses are from Sinai; only chapter numbers are Christian addition.

I think you mean to say that CHANTED verses are from Sinai.

You're going to have a lot of fun (by which I mean a lot of confusion) attempting to reconcile these two notions of a verse when it comes to the Decalogues.

TorahBibleCodes commented 4 months ago

You should be more respectful in how you speak.

On Thu, 25 Apr 2024 at 19:48 Ben Denckla @.***> wrote:

https://tanach.us

I am not sure what you mean by including this URL. It is a great site; I have devoted a lot of work to it through my contributions to it. But I don't see what point you're trying to make by mentioning it.

This placement of sof psuk here is a mistake in wherever it originated

You are using "sof psuk" in an ambiguous way. I would usually assume sof psuk (more often transliterated sof pasuq) to mean the end of a CHANTED verse, not the end of a NUMBERED verse. In many contexts it specifically means the colon-like mark that (along with silluq) indicates the end of a chanted verse.

No editions disagree about the division of this span of text into chanted verses. So, since they all agree, there can be no mistakes in any editions, unless you think they are all wrong.

Some editions disagree about the division of this span of text into numbered verses. Again, there can be no mistakes in any editions, but this this time the reason is that the division of the text into numbered verses is a question of taste, not a question of right and wrong.

Perhaps you are narrowly speaking about the Michigan-Claremont-format transcription of the Koren letter-text. Since it does not use the Michigan-Claremont code for sof pasuq (00), perhaps you could say it is in error since perhaps you could say that the files imply that every numbered verse is also a chanted verse. A more charitable interpretation of those files is that they make no representation about the boundaries of chanted verses, so they cannot be in error about such a boundary question.

Your repeated failure (or is it refusal?) to distinguish the notions of CHANTED verse and NUMBERED verse make this conversation unlikely to be productive. Yet, I am still trying to explain.

Verses are from Sinai; only chapter numbers are Christian addition.

I think you mean to say that CHANTED verses are from Sinai.

You're going to have a lot of fun (by which I mean a lot of confusion) attempting to reconcile these two notions of a verse when it comes to the Decalogues.

— Reply to this email directly, view it on GitHub https://github.com/bdenckla/MAM-for-Sefaria/issues/1#issuecomment-2077731606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7UWECN36JJLEW4MSQ2L5LY7EXUNAVCNFSM6AAAAABDSROQKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXG4ZTCNRQGY . You are receiving this because you authored the thread.Message ID: @.***>

bdenckla commented 4 months ago

You should be more respectful in how you speak.

I apologize for my tone; I guess I have reached the end of my patience and thus should not write any more.

If you have any technical questions about how to use any edition of MAM, please open up a new GitHub issue.

Other than that, I think we should conclude this discussion.