Clear-Bible / ClearDashboard

The ClearDashboard project
Other
1 stars 2 forks source link

Resolve issue with verse trees not containing enough information to produce full and complete hebrew sentence surface text. #152

Open russellmorley opened 2 years ago

russellmorley commented 2 years ago

Implement design decided at GR Macula meeting that extends verse trees with lofat trees, linking them by a map:

LofatTrees: (id = ? at same granularity as verse trees morph id) Mike likes chêêse. [ (1)] [(2) ] [ (3) ]

VerseTrees: (ID = morphId of leafs) Mike (a) likes (b) cheese (c)

Map 1 a 2 b 3 c

Dashboard uses Lowfat trees for text display, e.g. “Mike likes chêêse.” uses Versetrees text tokens, e.g. “cheese”, to perform alignment, converting resulting alignments from Verse tree ids to Lowfat trees ids and storing as Lowfat trees ids. Identifies tokens within text display by relating alignments (in Lowfat trees ids) to Lowfat trees ids identifiable within text display.

russellmorley commented 2 years ago

Hi @themikejr , please provide me a sample of lofat trees so I can validate what ids to use in the mapping between lofat and verse to display full hebrew sentence text.

themikejr commented 2 years ago

Note: the motivation for using Lowfat trees for display is that the lowfat xml format is going to be Macula's "living branch" where new items are added and changes are made. Using the lowfat trees for display (including the surface text and other metadata) connects us to the active branch of Macula data. I have asked the Macula team to provide versioned released (using SemVer) to make it easier for us to stay integrated with their ongoing work.

themikejr commented 2 years ago

@russellmorley

Public repositories: hebrew and greek.

Direct links: hebrew OSHB lowfat and greek Nestle1904 lowfat.

Please note: a stable version of the trees has not been released, there is no guarantee that this won't change.

What I know about the ID format

To map words (greek) or word parts (hebrew), I believe you would want to use the xml:id attribute on the <w> element. I believe you will find two variants of the xml:id format in hebrew:

  1. xml:id="o010010010051"
  2. xml:id="o010010050031ה"

Here are the semantics of the field as I understand them.

Hebrew sample

  <sentence>
      <p>
         <milestone unit="verse" id="GEN 1:1">GEN 1:1</milestone> בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃</p>
      <wg class="cl" head="false" rule="pp-v-s-o">
         <wg role="pp" class="pp" head="false" rule="prepnp">
            <w ref="GEN 1:1!1"
               xml:id="o010010010011"
               english="in"
               morph="R"
               pos="preposition"
               transliteration="bə"
               stronglemma="b"
               greek="ἐν"
               greekstrong="1722"
               strongnumberx="0871a"
               class="prep"
               unicode="בְּ"
               lang="H"
               lemma="b">בְּ</w>
            <w ref="GEN 1:1!1"
               xml:id="o010010010012"
               mandarin="起初"
               english="beginning"
               morph="Ncfsa"
               pos="noun"
               after=" "
               type="common"
               transliteration="rēʾšiyṯ"
               sdbh="006652001001000"
               stronglemma="7225"
               lexdomain="002003003004"
               sensenumber="1"
               greek="ἀρξῇ"
               greekstrong="746"
               strongnumberx="7225"
               class="noun"
               unicode="רֵאשִׁ֖ית"
               lang="H"
               lemma="7225"
               gender="feminine"
               number="singular"
               state="absolute">רֵאשִׁ֖ית</w>
         </wg>
         <wg role="v" class="vp" head="true" rule="v2vp">
            <w ref="GEN 1:1!2"
               xml:id="o010010010021"
               mandarin="创造"
               english="created"
               morph="Vqp3ms"
               pos="verb"
               after=" "
               type="qatal"
               gloss="he.created"
               transliteration="bārāʾ"
               sdbh="001156001002000"
               stronglemma="1254 a"
               lexdomain="002002002005"
               contextualdomain="173"
               coredomain="026 050"
               sensenumber="1"
               frame="A0:010010010031; A1:010010010052;010010010072;"
               greek="ἐποίησεν"
               greekstrong="4160"
               strongnumberx="1254"
               class="verb"
               unicode="בָּרָ֣א"
               lang="H"
               lemma="1254 a"
               gender="masculine"
               number="singular"
               stem="qal"
               person="third">בָּרָ֣א</w>
         </wg>
         <wg role="s" class="np" head="true" rule="n2np">
            <w ref="GEN 1:1!3"
               xml:id="o010010010031"
               mandarin="神/上帝"
               english="God"
               morph="Ncmpa"
               pos="noun"
               after=" "
               type="common"
               gloss="God"
               transliteration="ʾĕlōhiym"
               sdbh="000397001003000"
               stronglemma="430"
               lexdomain="001001001"
               coredomain="050"
               sensenumber="1"
               greek="θεὸς"
               greekstrong="2316"
               strongnumberx="0430"
               class="noun"
               unicode="אֱלֹהִ֑ים"
               lang="H"
               lemma="430"
               gender="masculine"
               number="plural"
               state="absolute">אֱלֹהִ֑ים</w>
         </wg>
         <wg role="o" class="np" head="true" rule="npanp">
            <wg class="np" head="false" rule="ompnp">
               <w ref="GEN 1:1!4"
                  xml:id="o010010010041"
                  morph="To"
                  pos="particle"
                  after=" "
                  type="direct object marker"
                  gloss="(et)"
                  transliteration="ʾēṯ"
                  sdbh="000792003001000"
                  stronglemma="853"
                  lexdomain="004003"
                  strongnumberx="0853"
                  class="om"
                  unicode="אֵ֥ת"
                  lang="H"
                  lemma="853">אֵ֥ת</w>
               <wg class="np" head="false" rule="detnp">
                  <w ref="GEN 1:1!5"
                     xml:id="o010010010051"
                     english="the"
                     morph="Td"
                     pos="particle"
                     type="definite article"
                     transliteration="ha"
                     stronglemma="d"
                     greek="τὸν"
                     greekstrong="3588"
                     strongnumberx="1886a"
                     class="art"
                     unicode="הַ"
                     lang="H"
                     lemma="d">הַ</w>
                  <w ref="GEN 1:1!5"
                     xml:id="o010010010052"
                     mandarin="诸天"
                     english="heavens"
                     morph="Ncmpa"
                     pos="noun"
                     after=" "
                     type="common"
                     transliteration="ššāmayim"
                     sdbh="007458001002000"
                     stronglemma="8064"
                     lexdomain="001005006"
                     contextualdomain="026"
                     sensenumber="1"
                     greek="οὐρανὸν"
                     greekstrong="3772"
                     strongnumberx="8064"
                     class="noun"
                     unicode="שָּׁמַ֖יִם"
                     lang="H"
                     lemma="8064"
                     gender="masculine"
                     number="plural"
                     state="absolute">שָּׁמַ֖יִם</w>
               </wg>
            </wg>
            <w ref="GEN 1:1!6"
               xml:id="o010010010061"
               mandarin="和"
               english="and"
               morph="C"
               pos="conjunction"
               transliteration="wə"
               stronglemma="c"
               greek="καὶ"
               greekstrong="2532"
               strongnumberx="2050b"
               class="cj"
               unicode="וְ"
               lang="H"
               lemma="c">וְ</w>
            <wg class="np" head="false" rule="ompnp">
               <w ref="GEN 1:1!6"
                  xml:id="o010010010062"
                  morph="To"
                  pos="particle"
                  after=" "
                  type="direct object marker"
                  transliteration="ʾēṯ"
                  sdbh="000792003001000"
                  stronglemma="853"
                  lexdomain="004003"
                  strongnumberx="0853"
                  class="om"
                  unicode="אֵ֥ת"
                  lang="H"
                  lemma="853">אֵ֥ת</w>
               <wg class="np" head="false" rule="detnp">
                  <w ref="GEN 1:1!7"
                     xml:id="o010010010071"
                     english="the"
                     morph="Td"
                     pos="particle"
                     type="definite article"
                     transliteration="hā"
                     stronglemma="d"
                     greek="τὴν"
                     greekstrong="3588"
                     strongnumberx="1886a"
                     class="art"
                     unicode="הָ"
                     lang="H"
                     lemma="d">הָ</w>
                  <w ref="GEN 1:1!7"
                     xml:id="o010010010072"
                     mandarin="大地"
                     english="earth"
                     morph="Ncbsa"
                     pos="noun"
                     after="׃"
                     type="common"
                     transliteration="ʾāreṣ:"
                     sdbh="000715001001000"
                     stronglemma="776"
                     lexdomain="001005002"
                     coredomain="006 090 173"
                     sensenumber="2"
                     greek="γῆν"
                     greekstrong="1093"
                     strongnumberx="0776"
                     class="noun"
                     unicode="אָֽרֶץ"
                     lang="H"
                     lemma="776"
                     gender="both"
                     number="singular"
                     state="absolute">אָֽרֶץ</w>
               </wg>
            </wg>
         </wg>
      </wg>
   </sentence>
russellmorley commented 2 years ago

Hi @themikejr , can the following be ignored?: "In hebrew there is an optional trailing character which I think will always be a hebrew consonant." Or said another way, can I just focus on the "BBCCCVVVWWWM " part of the "w" element? Is this what jonathan is describing in the 'dashboard requirements for trees' ?

themikejr commented 2 years ago

@russellmorley From my viewpoint, it's preferable to not ignore any portion of the IDs and instead treat the IDs in the tree as a "black box" string that can be used for equality checks. This way we don't have to maintain a secondary map from our version of the tree IDs to the ones in the publicly released trees. Treating the IDs this way will also set us up to be anti less fragile to upstream data changes. Put another way, let's let unique identifiers be only unique identifiers as much as possible. If we need to relate a context to an element in the tree (book, chapter, verse, etc...) the ref attribute might be more helpful and stable.

Ignoring portions of the IDs would also make it more complicated to use the ID mapping that the macula team plans on giving us and may even break data integrity.

themikejr commented 2 years ago

Is this what jonathan is describing in the 'dashboard requirements for trees' ?

There is too much in that document for me to take a stab at answering that off the cuff 😄

russellmorley commented 9 months ago

@romanpoz , I believe this is already resolved, as evidenced by no issues with hebrew coming from users.