Clear-Bible / macula-greek

Syntax trees, morphology, and linguistic annotations for the Greek Bible
Other
23 stars 6 forks source link

Extracting domains and senses from MARBLE #21

Closed klosoter closed 2 years ago

klosoter commented 2 years ago

We plan to construct the data we use from MARBLE in the following way:

  1. We collect //LEXMeanings from the lexicon files and group them by LEXReference, the 'marbleId' so to speak.
  2. We also collect English Glosses from //LEXSense[@LanguageCode="en"]//Gloss and Domains from //LEXDomain and the EntryCode from @EntryCode. (and possibly the SubDomain from //LEXSubDomain
  3. We then look up the (sub)domain identifiers in SDBG-DOMAINS1.XML

Assuming this is all connected, we create the following attributes to our tree words:

When there are multiple entries/senses/domains for one word/ref, we concatenate them with "|"

klosoter commented 2 years ago

We need the SubDomain too. How would we add that to the domain attribute?

domain="DomainNumber:DomainName:SubDomainName" or domain="SubDomainNumber:DomainName:SubDomainName" since the subdomain number includes the domain number?

ryderwishart commented 2 years ago

We need the SubDomain too. How would we add that to the domain attribute?

domain="DomainNumber:DomainName:SubDomainName" or domain="SubDomainNumber:DomainName:SubDomainName" since the subdomain number includes the domain number?

I like the idea of putting the number first.

klosoter commented 2 years ago

In both cases, the number is first. I'm thinking about which number to use. Example: Domain: "Physical Impact" DomainNumber: 019 SubDomain: "Press" SubDomainNumber: 019005

So, I guess it makes sense to use only the SubDomainNumber since it contains (expands) the DomainNumber. And then, number first:

<w domain="019005:Physical Impact:Press" sdbg"..."/>
klosoter commented 2 years ago

But is there a more preferable way to add both domain names?

ryderwishart commented 2 years ago

Ah, that makes more sense. I would think it would be best to simply use the subdomain number alone. Unless we want to have everything there, then I would think you could do domain#:subdomain#, or just do domain="domain#:domainlabel" subdomain="subdomain#:subdomainlabel".

klosoter commented 2 years ago

After checking the MARBLE domains for the OT, I noticed that we're only using the subdomain label now in the NT.

I suggest expanding the domain attribute by the domain label, like this: image

The format then changes

from this: domain="subdomainID:subdomainLabel"

domain="001014:Population Centers"

to this: domain="subdomainID:domainLabel:subdomainLabel"

domain="001014:Geographical Objects and Features:Population Centers"
jonathanrobie commented 2 years ago

Yes, please do.

jonathanrobie commented 2 years ago

On second thought ... how about:

domain="subdomainID:domainLabel:subdomainLabel"
domain="001014:Population Centers"
jonathanrobie commented 2 years ago

Looking at SDBG, I currently see this:

                     <w role="v"
                        ref="LUK 1:2!2"
                        class="verb"
                        xml:id="n42001002002"
                        lemma="παραδίδωμι"
                        normalized="παρέδοσαν"
                        strong="3860"
                        number="plural"
                        person="third"
                        tense="aorist"
                        voice="active"
                        mood="indicative"
                        head="true"
                        domain="033017;Teach"
                        sdbg="παραδίδωμι;33.237;to instruct, to teach">παρέδοσαν</w>

There are two hierarchies in this. 33017 is equivalent to Q Teach (33.224-33.250) in this index - Q is the 17th letter of the alphabet:

https://www.laparola.net/greco/louwnida.php#1

Thus, 033017 contains 33.237. We effectively have 3 levels of domain: 33, 17, 237, where the numbering systems for levels 2 and 3 are independent and overlapping. The "33" part of "33017" and "33.237" has the same meaning.

For today, at least, let's make this:

domain="033017"
ln="33.237"
ryderwishart commented 2 years ago

@klosoter, want me to do this or do you want to take care of it?

klosoter commented 2 years ago

If you have the time, please!

klosoter commented 2 years ago

It seems that some domain and ln data got lost:

Multiple entries got joined into one:

domain="010002033003" ln="10.24"

should be

domain="010002 033003" ln="10.24 33.19"

And domains with only 3-digits have gotten the value "" (possibly because they are redundant since ln also has these digits).

klosoter commented 2 years ago

See the new data file here to see how the MARBLE data we use is connected to our nodes.

klosoter commented 2 years ago

See new pull request