Clear-Bible / macula-hebrew

Syntax trees, morphology, and linguistic annotations for the Hebrew Bible
Other
36 stars 10 forks source link

Additional Data to Port over from Original Trees #3

Open rkjtan opened 2 years ago

rkjtan commented 2 years ago

Participant referent data: SubjRef only on verbs with implied subjects; format SubjRef="{010010310021}" Ref only on nouns, pronouns, or adjectives usually; format Ref="{010010120082}"

LXX mapping: GreekStrong="1722" Greek="e)poi/hsen"

Semantic Roles: Frame="{A0:010010310021; A1:010010310041;}"

Word Sense data: SenseNumber="2" (Sense Number="0" means that it is a function word that we didn't do word sense disambiguation on)

Glosses: English and Chinese glosses in the full trees cannot be used--Mike has Andi's automatically calculated glosses for English and Chinese mapped for YTB & ClearSuite that we should be able to use

Object Complements: There are actually two types of nodes that are currently labeled as O2 (second object). Some have the attribute Label="OC" in the original trees, meaning that it is object complement rather than strictly a second object. This Label="OC" data needs to be ported over & used to convert the relevant O2's into OCs.

Vocatives: Attribute Vocative="True" (ignore all Vocative="False")

For comparison & checking purposes at some point later in the process: Compare Strong Number and Strong NumberX work Clear did to OSHB's values (in vast majority of cases they would be identical & so there's no need to add these usually redundant values, but a comparison may show some places to double-check OSHB).

jonathanrobie commented 2 years ago

Participant referent data: SubjRef only on verbs with implied subjects; format SubjRef="{010010310021}" Ref only on nouns, pronouns, or adjectives usually; format Ref="{010010120082}"

@rkjtan - If I use SubjRef and Ref whenever the value is not "{}", will that give the right result? Are there instances where these have values but are not desired?

jonathanrobie commented 2 years ago

Glosses: English and Chinese glosses in the full trees cannot be used--Mike has Andi's automatically calculated glosses for English and Chinese mapped for YTB & ClearSuite that we should be able to use

@themikejr - could you please tell me where to find these?

jonathanrobie commented 2 years ago

Semantic Roles: Frame="{A0:010010310021; A1:010010310041;}"

@rkjtan I assume we also want FrameGloss?

rkjtan commented 2 years ago

For SubjRef & Ref, there are only two known typos right now.

In da4:29, there is a reference, MorphId 270040290241, in "SubjRef" for two verbs with MorphId 270040290281 (יִצְבֵּא‎) and 270040290291 (יִתְּנִנּ‎), there is a typo: It should be 270040290231 "the Most High" & not 270040290241.

In Es4:4 the "SubjRef" reference 170040040092 (for the verb תִּשְׁלַח‎ with MorphId 170040040102) is a typo: Should be 170040040082, not 170040040092.

rkjtan commented 2 years ago

On the glosses, I had sent you an email with a link to Andi’s tsv file on DropBox & also to Ulrik's application of these to Andi's version of the OSHB trees. Mike may have another source.

jonathanrobie commented 2 years ago

Except for the typos mentioned above and the glosses from Mike, this should be close:

annotations.xml.zip

Here is the query I used to generate this:

declare function local:annotations($n)
{
  <node>{
      $n ! (
      @morphId,
      @StrongNumberX,
      @Vocative[.="True"],
      @SenseNumber[. ne "0"],

      @Frame[. ne "{}"],
      @FrameGloss,

      @Ref[. ne "{}"],
      @SubjRef[. ne "{}"],

      @Greek[. ne ""],
      @GreekStrong[. ne "0"]
      )
  }</node>
};

<annotations>{
  for $n in //Node[empty(*)]
  order by $n/@morphId
  return local:annotations($n) 
}</annotations>
rkjtan commented 2 years ago

If we are not doing glosses for the Ref or SubjRef, perhaps we shouldn't do gloss for the Frame either for the XML? One way we could go is to use the Ids strictly as an index to find the right words, including any glosses you have for them. This would show up nicely in a UI. However, we might be clogging up the XML file with lots of glosses if we add glosses for every annotation. What do you think?

jonathanrobie commented 2 years ago

Inline glosses are very useful for queries and also to help people who do not read Hebrew or Greek have some idea what is in the tree. I would really prefer glosses on the words themselves.

For Ref, SubjRef, and Frame, using an external dictionary may be the right way to go.

rkjtan commented 2 years ago

One important caveat for the Refs, SubjRef, & Frame should be expressed: At the time I was doing the work, I used Id because I knew that nodeIds could change & lead to errors more easily. However, when a referent is actually a phrase rather than a single word, the Id doesn't cover the whole phrase. In other words, whenever the word that I refer to in Ref, SubjRef, or Frame is actually part of a larger phrase, it is really the phrase node that is the referent & not just the single word identified in the Id. For example, if the referent is a phrase "Jesus Christ," rather than just "Jesus," I was only able to refer to the head of the phrase "Jesus" with Id. However, the referent is "Jesus Christ." By going up to the full noun phrase with "Jesus" as head, you find the full referent. The most extreme example in the OT is ps49:14-15, there are references to MorphId 190490140052, in "Ref", "SubjRef", and "Frame" for a number of morphemes in these two verses (190490140022, 190490140071, 190490150031, 190490150052, 190490150072, 190490150113, 190490150152). The Id used is 190490140052, which refers to a "prep." This is because it is a pp that actually functions as a noun "those after them/their followers." So the whole pp phrase that functions as an np is the referent. So, in the UI, we need to make sure that the Ids for Ref, SubjRef, Frame by default result in the selection of the whole phrase node of which the word with the Id is the head. Not sure what the best way to do this is for the github release of the data itself (perhaps switch over to nodeId at a later point according to the rule I express here once nodeIds are stable).

jonathanrobie commented 2 years ago

wrt phrase references https://github.com/Clear-Bible/macula-hebrew/issues/3#issuecomment-1080844802, I think we need a good way to reference subtrees. One way might be to use the first and last morphemes in a spanning tree as a reference to their least common ancestor. Would that be sufficient?

For instance, lca(010010010011, 010010010012) could refer to the least common ancestor of these two nodes in the following subtree:

<Node Cat="pp" Start="0" End="1" Rule="PrepNp" Head="1" nodeId="010010010010060" Length="6">
  <Node Cat="pp" Start="0" End="0" Rule="P2PP" Head="0" nodeId="010010010010011" Length="1">
    <Node n="010010010011" Cat="prep" Start="0" End="0" Length="1" morphId="010010010011" Unicode="בְּ" nodeId="010010010010010">
      <m n="010010010011" morph="R" lang="H" lemma="b" pos="preposition">בְּ</m>
    </Node>
  </Node>
  <Node Cat="np" Start="1" End="1" Rule="N2NP" Head="0" nodeId="010010010020051" Length="5">
    <Node n="010010010012" Cat="noun" Start="1" End="1" Length="5" morphId="010010010012" Unicode="רֵאשִׁית" nodeId="010010010020050">
      <m n="010010010012" morph="Ncfsa" lang="H" lemma="7225" after=" " pos="noun" type="common" gender="feminine" number="singular" state="absolute">רֵאשִׁית</m>
    </Node>
  </Node>
</Node>
rkjtan commented 2 years ago

That might work. Note, however, that a prepositional phrase (with the one exception in Ps49:14-15 in the OT & an unknown small number of cases in the NT) is usually not going to be a Ref or SubjRef (Frame might be different). A good example is:

            <Node Cat="S" Start="2" End="3" Rule="Np2S" Head="0" nodeId="010020070060091" Length="9">
              <Node Cat="np" Start="2" End="3" Rule="Np-Appos" Head="0" nodeId="010020070060090" Length="9">
                <Node Cat="np" Start="2" End="2" Rule="N2NP" Head="0" nodeId="010020070060041" Length="4">
                  <Node n="010020070021" Cat="noun" Start="2" End="2" Length="4" morphId="010020070021" Unicode="יְהוָ֨ה" nodeId="010020070060040"><m n="010020070021" lang="H" after=" " lemma="3068" morph="Np" id="01pPp" pos="noun" type="proper">יְהוָ֨ה</m></Node>
                </Node>
                <Node Cat="np" Start="3" End="3" Rule="N2NP" Head="0" nodeId="010020070100051" Length="5">
                  <Node n="010020070031" Cat="noun" Start="3" End="3" Length="5" morphId="010020070031" Unicode="אֱלֹהִ֜ים" nodeId="010020070100050"><m n="010020070031" lang="H" after=" " lemma="430" morph="Ncmpa" id="01ieN" pos="noun" type="common" gender="masculine" number="plural" state="absolute">אֱלֹהִ֜ים</m></Node>
                </Node>
              </Node>
            </Node> 

"Yahweh God" is the implied subject of the verb "he breathed" (n="010020070082"), but the SubjRef just uses 010020070021 for Yahweh because Yahweh is the head of the noun phrase. Head, Start, & End are zero-based. So, Start="2" End="3" tells us that words 3 & 4 in the verse are in the phrase. Head="0" tells us that the first word inside the phrase, "Yahweh," is the head of the noun phrase. We can safely take the whole noun phrase as the referent. You could theoretically use the nodeId. However, nodeId="010020070060090" is currently consonant based. If it were word based, it would be 010020070030020. For the Ref & SubjRef, theoretically prepositional phrases in the node with Cat="pp" Rule="PrepNp" also still have the head noun as the head. So, we could end up with systematically bringing in the prepositional phrase when only the noun phrase that is the object of the preposition is the referent. We will want to make the head of the prepositional phrase consistently the preposition to avoid this problem.

jonathanrobie commented 2 years ago

Let's raise a new issue for references to things larger than single words.

https://github.com/Clear-Bible/macula-hebrew/issues/10

For now, the references are what they are.