Open russellmorley opened 2 years ago
Hi @themikejr , please provide me a sample of lofat trees so I can validate what ids to use in the mapping between lofat and verse to display full hebrew sentence text.
Note: the motivation for using Lowfat trees for display is that the lowfat xml format is going to be Macula's "living branch" where new items are added and changes are made. Using the lowfat trees for display (including the surface text and other metadata) connects us to the active branch of Macula data. I have asked the Macula team to provide versioned released (using SemVer) to make it easier for us to stay integrated with their ongoing work.
@russellmorley
Public repositories: hebrew and greek.
Direct links: hebrew OSHB lowfat and greek Nestle1904 lowfat.
Please note: a stable version of the trees has not been released, there is no guarantee that this won't change.
To map words (greek) or word parts (hebrew), I believe you would want to use the xml:id
attribute on the <w>
element.
I believe you will find two variants of the xml:id
format in hebrew:
xml:id="o010010010051"
xml:id="o010010050031ה"
Here are the semantics of the field as I understand them.
o
. I think the same thing is planned for the NT using the n
character. <sentence>
<p>
<milestone unit="verse" id="GEN 1:1">GEN 1:1</milestone> בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃</p>
<wg class="cl" head="false" rule="pp-v-s-o">
<wg role="pp" class="pp" head="false" rule="prepnp">
<w ref="GEN 1:1!1"
xml:id="o010010010011"
english="in"
morph="R"
pos="preposition"
transliteration="bə"
stronglemma="b"
greek="ἐν"
greekstrong="1722"
strongnumberx="0871a"
class="prep"
unicode="בְּ"
lang="H"
lemma="b">בְּ</w>
<w ref="GEN 1:1!1"
xml:id="o010010010012"
mandarin="起初"
english="beginning"
morph="Ncfsa"
pos="noun"
after=" "
type="common"
transliteration="rēʾšiyṯ"
sdbh="006652001001000"
stronglemma="7225"
lexdomain="002003003004"
sensenumber="1"
greek="ἀρξῇ"
greekstrong="746"
strongnumberx="7225"
class="noun"
unicode="רֵאשִׁ֖ית"
lang="H"
lemma="7225"
gender="feminine"
number="singular"
state="absolute">רֵאשִׁ֖ית</w>
</wg>
<wg role="v" class="vp" head="true" rule="v2vp">
<w ref="GEN 1:1!2"
xml:id="o010010010021"
mandarin="创造"
english="created"
morph="Vqp3ms"
pos="verb"
after=" "
type="qatal"
gloss="he.created"
transliteration="bārāʾ"
sdbh="001156001002000"
stronglemma="1254 a"
lexdomain="002002002005"
contextualdomain="173"
coredomain="026 050"
sensenumber="1"
frame="A0:010010010031; A1:010010010052;010010010072;"
greek="ἐποίησεν"
greekstrong="4160"
strongnumberx="1254"
class="verb"
unicode="בָּרָ֣א"
lang="H"
lemma="1254 a"
gender="masculine"
number="singular"
stem="qal"
person="third">בָּרָ֣א</w>
</wg>
<wg role="s" class="np" head="true" rule="n2np">
<w ref="GEN 1:1!3"
xml:id="o010010010031"
mandarin="神/上帝"
english="God"
morph="Ncmpa"
pos="noun"
after=" "
type="common"
gloss="God"
transliteration="ʾĕlōhiym"
sdbh="000397001003000"
stronglemma="430"
lexdomain="001001001"
coredomain="050"
sensenumber="1"
greek="θεὸς"
greekstrong="2316"
strongnumberx="0430"
class="noun"
unicode="אֱלֹהִ֑ים"
lang="H"
lemma="430"
gender="masculine"
number="plural"
state="absolute">אֱלֹהִ֑ים</w>
</wg>
<wg role="o" class="np" head="true" rule="npanp">
<wg class="np" head="false" rule="ompnp">
<w ref="GEN 1:1!4"
xml:id="o010010010041"
morph="To"
pos="particle"
after=" "
type="direct object marker"
gloss="(et)"
transliteration="ʾēṯ"
sdbh="000792003001000"
stronglemma="853"
lexdomain="004003"
strongnumberx="0853"
class="om"
unicode="אֵ֥ת"
lang="H"
lemma="853">אֵ֥ת</w>
<wg class="np" head="false" rule="detnp">
<w ref="GEN 1:1!5"
xml:id="o010010010051"
english="the"
morph="Td"
pos="particle"
type="definite article"
transliteration="ha"
stronglemma="d"
greek="τὸν"
greekstrong="3588"
strongnumberx="1886a"
class="art"
unicode="הַ"
lang="H"
lemma="d">הַ</w>
<w ref="GEN 1:1!5"
xml:id="o010010010052"
mandarin="诸天"
english="heavens"
morph="Ncmpa"
pos="noun"
after=" "
type="common"
transliteration="ššāmayim"
sdbh="007458001002000"
stronglemma="8064"
lexdomain="001005006"
contextualdomain="026"
sensenumber="1"
greek="οὐρανὸν"
greekstrong="3772"
strongnumberx="8064"
class="noun"
unicode="שָּׁמַ֖יִם"
lang="H"
lemma="8064"
gender="masculine"
number="plural"
state="absolute">שָּׁמַ֖יִם</w>
</wg>
</wg>
<w ref="GEN 1:1!6"
xml:id="o010010010061"
mandarin="和"
english="and"
morph="C"
pos="conjunction"
transliteration="wə"
stronglemma="c"
greek="καὶ"
greekstrong="2532"
strongnumberx="2050b"
class="cj"
unicode="וְ"
lang="H"
lemma="c">וְ</w>
<wg class="np" head="false" rule="ompnp">
<w ref="GEN 1:1!6"
xml:id="o010010010062"
morph="To"
pos="particle"
after=" "
type="direct object marker"
transliteration="ʾēṯ"
sdbh="000792003001000"
stronglemma="853"
lexdomain="004003"
strongnumberx="0853"
class="om"
unicode="אֵ֥ת"
lang="H"
lemma="853">אֵ֥ת</w>
<wg class="np" head="false" rule="detnp">
<w ref="GEN 1:1!7"
xml:id="o010010010071"
english="the"
morph="Td"
pos="particle"
type="definite article"
transliteration="hā"
stronglemma="d"
greek="τὴν"
greekstrong="3588"
strongnumberx="1886a"
class="art"
unicode="הָ"
lang="H"
lemma="d">הָ</w>
<w ref="GEN 1:1!7"
xml:id="o010010010072"
mandarin="大地"
english="earth"
morph="Ncbsa"
pos="noun"
after="׃"
type="common"
transliteration="ʾāreṣ:"
sdbh="000715001001000"
stronglemma="776"
lexdomain="001005002"
coredomain="006 090 173"
sensenumber="2"
greek="γῆν"
greekstrong="1093"
strongnumberx="0776"
class="noun"
unicode="אָֽרֶץ"
lang="H"
lemma="776"
gender="both"
number="singular"
state="absolute">אָֽרֶץ</w>
</wg>
</wg>
</wg>
</wg>
</sentence>
Hi @themikejr , can the following be ignored?: "In hebrew there is an optional trailing character which I think will always be a hebrew consonant." Or said another way, can I just focus on the "BBCCCVVVWWWM " part of the "w" element? Is this what jonathan is describing in the 'dashboard requirements for trees' ?
@russellmorley From my viewpoint, it's preferable to not ignore any portion of the IDs and instead treat the IDs in the tree as a "black box" string that can be used for equality checks. This way we don't have to maintain a secondary map from our version of the tree IDs to the ones in the publicly released trees. Treating the IDs this way will also set us up to be anti less fragile to upstream data changes. Put another way, let's let unique identifiers be only unique identifiers as much as possible. If we need to relate a context to an element in the tree (book, chapter, verse, etc...) the ref
attribute might be more helpful and stable.
Ignoring portions of the IDs would also make it more complicated to use the ID mapping that the macula team plans on giving us and may even break data integrity.
Is this what jonathan is describing in the 'dashboard requirements for trees' ?
There is too much in that document for me to take a stab at answering that off the cuff 😄
@romanpoz , I believe this is already resolved, as evidenced by no issues with hebrew coming from users.
Implement design decided at GR Macula meeting that extends verse trees with lofat trees, linking them by a map:
LofatTrees: (id = ? at same granularity as verse trees morph id) Mike likes chêêse. [ (1)] [(2) ] [ (3) ]
VerseTrees: (ID = morphId of leafs) Mike (a) likes (b) cheese (c)
Map 1 a 2 b 3 c
Dashboard uses Lowfat trees for text display, e.g. “Mike likes chêêse.” uses Versetrees text tokens, e.g. “cheese”, to perform alignment, converting resulting alignments from Verse tree ids to Lowfat trees ids and storing as Lowfat trees ids. Identifies tokens within text display by relating alignments (in Lowfat trees ids) to Lowfat trees ids identifiable within text display.