bdenckla / MAM-for-Sefaria

Miqra According to the Masorah for Sefaria
Other
1 stars 0 forks source link

Integration of MAM Codex into TorahBibleCodes Bible Research Software (issue: PARSING) #2

Closed TorahBibleCodes closed 3 months ago

TorahBibleCodes commented 3 months ago

בע"ה

Shalom, Hi.

I have successfully integrated your MAM texts into our TorahBibleCodes Bible Research Software - not yet shared on GitHub.

I have encountered a few issues in parsing the NIKKUD, and have solved them so far, but am encountering an issue now, and would like to share still-private development files with you.

Can I have your e-mail address to send you some development files for you to look at?

I may need your help in finding a solution for parsing the NIKKUD in these texts.

Thanks.

TorahBibleCodes commented 3 months ago

If you would like to send me your private e-mail address, please send e-mail to info@torahbiblecodes.com.

I am closing this issue because I solved the issue by changing the order of the parse and normalization.

For whatever reason, it was not parsing out the PASEQ with the following function when it was later in the program after the normalization of the Hebrew.

The solution was just parsing this one character out first, and then doing the normalization for the rest of the NIKKUD afterwards.

import re

## BEGIN FUNCTION() - REMOVE PASEQ
def fn_RemovePaseq(text):

    ## PATTERN TO MATCH PASEQ
    pattern = r'\u05C0'  # Paseq

    ## REPLACE THE MATCHED PATTERNS WITH EMPTY SPACE
    clean_text = re.sub(pattern, ' ', text)

    ## RETURN VARIABLES
    return clean_text

## END FUNCTION

# Example usage:
x = 'וֶהֱשִֽׁיבְךָ֨ יְהֹוָ֥ה׀׀מִצְרַ֘יִם֮ בׇּאֳנִיּוֹת֒ בַּדֶּ֙רֶךְ֙ אֲשֶׁ֣ר אָמַ֣רְתִּֽי לְךָ֔ לֹא־תֹסִ֥יף ע֖וֹד לִרְאֹתָ֑הּ וְהִתְמַכַּרְתֶּ֨ם שָׁ֧ם לְאֹיְבֶ֛יךָ לַעֲבָדִ֥ים וְלִשְׁפָח֖וֹת וְאֵ֥ין קֹנֶֽה׃ {ס}'

cleaned_text = fn_RemovePaseq(x)
print(cleaned_text)  # The output should have the Paseq characters removed
TorahBibleCodes commented 3 months ago

Closed.

bdenckla commented 3 months ago

Weird that paseq was causing your problems, but I'm glad you were able to figure this out.

TorahBibleCodes commented 3 months ago

Hi.

After parsing that text, and counting the Hebrew text of Torah, either 4 letters were parsed out by mistake OR this MAM Codex is 304,801 letters vs. Koren (Hebrew Tradition) 304,805 letters of Torah.

YH = 304801; XW = 1

It would be VERY interesting if indeed this MAM Codex is slightly different than the Koren by 4 total letters in length, but there may be actual multiple letters that are different (i.e. more than this difference of 4) - just like between Leningrad's 304,850 vs. Koren's 304,805 (i.e. difference of 45 letters total), but there are more than this number of total differences between the texts: only the total number of 45 does not reflect all of these differences.

This is VERY interesting.

Would you be so kind to confirm that I parsed the text correctly and have the right number by parsing yourself out the NIKKUD and counting the number of letters of Torah of this MAM codex?

TorahBibleCodes commented 3 months ago

There are definite parsing issues, e.g. EXODUS 20:13-14.

Nonetheless, between the verses without parsing issues, we can see that there are definite scribal differences between verses in the MAM vs. KOREN codices.

Please see SCREENSHOT:

image

TorahBibleCodes commented 3 months ago

Wow! I stand corrected: there is no error in parsing: The MAM text is different than the Koren!!

Please see SCREENSHOT of your CSV text files that confirms that my analysis and data are correct!!

image

TorahBibleCodes commented 3 months ago

Very interesting! The only book without any differences (letter-by-letter) in verses between MAM and KOREN is Leviticus.

As you can see in the SCREENSHOT, the book #3 Leviticus is missing between book #2 Exodus and book #4 Numbers.

SCREENSHOT:

image

TorahBibleCodes commented 3 months ago

There are definite issues with these text files with data consistency.

On the face of it, they appear identical in format with the data in the CSV, but a number of files are not working to parse and integrate into the program.

Genesis through Deuteronomy are ok.

There is a problem with Joshua, Judges, Daniel - I haven't yet tested them all.

I am getting a different error for each of the ones that don't parse correctly into the program.

The Book of Esther works and parses ok and integrates into the TorahBibleCodes Bible Research Program.

Very interesting that the MAM version of Esther is 12110 letters vs. Leningrad's 12112 letters.

Here is a SCREENSHOT that demonstrates even though the total difference in text length is only two letters, the following SCREENSHOT shows all verses where there is even one letter difference between the two texts.

I count a total of 14 verses between MAM and Leningrad's Book of Esther that have differences.

SCREENSHOT: image

bdenckla commented 3 months ago

There are well known differences between the various "scroll traditions" (Ashkenaz, Sefarad, Timan). You seem to be discovering these for yourself through comparison of editions (like comparing MAM-for-Sefaria with Koren). But I would imagine these differences pre-date the age of printing by centuries if not a couple of millennia. Many or all of these differences between scroll traditions of Torah and Esther are noted by MAM in its "scroll difference notes." You can see these in various editions of MAM, spread out through the text, but also you can see them all in one place listed as the "na-note" feature of interest here: https://bdenckla.github.io/MAM-with-doc/foi/foi-rare-tmpls.html. For instance, the E28:26 and N1:17 differences you "discovered" above are noted by MAM.

If you are having fun discovering this stuff yourself, that is great. But just be aware that people have carefully studied this stuff for centuries, in fact for a few millennia. So, if you want to accelerate your research, you could seek out sources on the topic.

TorahBibleCodes commented 3 months ago

The Leningrad and the Koren have the most reliable number of verses. They are identical - 5846 if I remember correctly.

These additions of further divisions in the MAM reveal that they are later additions - תוספות. אל תוסיף עליו ואל תגרע ממנו.

These may be later tradition addition of rabbis , but simple look at the Ten Commandments see spaces and not sof psuk samech... this is corruption. These are not the right verse number divisions - this has never changed from Sinai.

The Jewish tradition counts 12111 letters for Esther if I remember correctly - not 12110.

Truly Koren is the most faithful and only source codex I know of that is the 304805 length. It is nearly identical in its entirety to the Leningrad codex, and that itself is the only complete copy that is extant in its entirety as a complete book... it was the standard before Koren first published in the 1950s... after the Aleppo codex had vanished.

I still theorize/wonder that it may be a copy of the now/currently-lost Aleppo codex.

I need to publish a paper on this.

TorahBibleCodes commented 3 months ago

Since this MAM codex has 304801 letters, it would not be acceptable kosher torah to Orthodox Judaism as the actual Torah according to the masorah.

If I may request: Where is the source for this digitized, digital data?

bdenckla commented 3 months ago

If I may request: Where is the source for this digitized, digital data?

MAM's sources and methods appear in documentation accompanying its Wikisource edition.

TorahBibleCodes commented 3 months ago

Is there a direct link to the info?

bdenckla commented 3 months ago

Is there a direct link to the info?

The development process for MAM is described here.

TorahBibleCodes commented 3 months ago

Thanks. Would you be so kind to provide a direct link to the info page of what-and-from-where are the source codex/codices for these MAM texts?

TorahBibleCodes commented 3 months ago

Here is a link to the differences between the Leningrad and Koren Claremont Michigan transliteration because the Koren publishers don't agree to share their digitized, digital codex for either Torah or Tanach or any text whatsoever for scientific research.

https://github.com/TorahBibleCodes/ResearchData_KorenCodexVsLeningradCodex

bdenckla commented 3 months ago

Thanks. Would you be so kind to provide a direct link to the info page of what-and-from-where are the source codex/codices for these MAM texts?

I don't quite know what you're asking for, but probably all you want is on the page I already gave a link to. That link I gave was to a location somewhat far down on the page, so here's a link to the top of the page, in case that helps.

TorahBibleCodes commented 3 months ago

We are looking for source codices of the Miqra - not tikunim/"fixes".

We would like the source Aleppo codex as a dream, and wonder if the Koren is a resurfacing of a copy of the Aleppo codex.

Not sure what the MAM codex is.

Not sure what the Koren codex source is - they refuse to cooperate.

TorahBibleCodes commented 3 months ago

I see references to Aleppo as well as some various versions of rabbis, and also to Leningrad.

If I remember correctly, it is written that the blank space for the Miqra from the lost Torah from Aleppo was filled in with info from the Leningrad.

Then the Koren was first print in the Land of Israel since biblical times in the 1950s...

I wonduh... did the Aleppo codex survive and copied and printed in the 1950s...?

bdenckla commented 3 months ago

I wonduh... did the Aleppo codex survive and copied and printed in the 1950s...?

You have floated that wild theory here before. I will not engage with you on that topic again.

If, in the future, you have a specific technical question about how to use MAM, please create a new issue.

Otherwise, I think we are done here.

I will close this issue for now. I'm not sure what the original problem was that caused you to create this issue, but it seems that you overcame whatever original problem you had by adapting your code to the data contained in "MAM for Sefaria" (or whatever edition of MAM you are using).

TorahBibleCodes commented 3 months ago

Until we can confirm the codex for each part of your MAM, we can’t deploy and release the MAM integration.

We need to know which parts of the MAM are Aleppo, Leningrad, and various edits of various rabbis.

Cooperation and verifiable confirmation means more exposure for your MAM project.

But we can’t release with MAM integration until this can be confirmed.

On Wed, 22 May 2024 at 22:17 Ben Denckla @.***> wrote:

I wonduh... did the Aleppo codex survive and copied and printed in the 1950s...?

You have floated that wild theory here before. I will not engage with you on that topic again.

If, in the future, you have a specific technical question about how to use MAM, please create a new issue.

Otherwise, I think we are done here.

I will close this issue for now. I'm not sure what the original problem was that caused you to create this issue, but it seems that you overcame whatever original problem you had by adapting your code to the data contained in "MAM for Sefaria" (or whatever edition of MAM you are using).

— Reply to this email directly, view it on GitHub https://github.com/bdenckla/MAM-for-Sefaria/issues/2#issuecomment-2125577966, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7UWEA762QX4VGIC3FMMRLZDTVNJAVCNFSM6AAAAABIDMJSWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVGU3TOOJWGY . You are receiving this because you modified the open/close state.Message ID: @.***>