biblicalhumanities / treedown

Markdown for syntax trees - see http://jonathanrobie.biblicalhumanities.org/blog/2017/05/12/lowfat-treebanks-visualizing/
Apache License 2.0
7 stars 0 forks source link

label(s) for underspecified roles #22

Open jtauber opened 7 years ago

jtauber commented 7 years ago

As part of an initial incremental analysis or maybe even as a learning exercise, we may want to:

Or put briefly: add a way to just say "this is a constituent" or "this is an argument".

rkjtan commented 7 years ago

Below is just a general outline of how I think things might work (it is definitely not complete & may not work in the current form I propose). I lay it out for brainstorming purposes. I'm basically thinking aloud below about how best to have a system that can allow seamless integration between different levels of semi-automated analysis & complete manual analysis, depending on what data one has to work with (especially whether morphological analysis is available or not):

I. Semi-automated analysis followed by manual correction/supplementation If one is starting first with an initial pass using automated parsing with morphology, then (what I list below are the broad requirements, more details would need to be specified to make the grammar for the parser work):

Step 1: Analyze into separate predications Every verb is the core of a minimal predication Proposed convention automatically labeled, but hidden = P Automatically labeled, but revealed = V Verb type also automatically labeled (if morphological parsing available) = Indicative, Imperative, Subjunctive, Optative, Participle, Infinitive

Step 2: Determine components belonging to each core predication Conjunctions separating verbs used to separate components more likely belonging to one of two verbs separated by a conjunction (conjunctions between words or phrases typically conjoin words of the same word class & same case--most often nouns with nouns & adjectives with adjectives) Any verb forms a new core predication, whether preceded by a conjunction or not Components before the conjunction belong to the verb before the conjunction & components after the conjunction belong to the verb after the conjunction (if no conjunction, likewise by default put components before a second verb with the first of the two verbs & components after the second verb with the second of the two verbs)

Step 3: Determine typical phrase level structure (many exceptions, but try to capture as many as possible automatically) Adjectives adjectivally modify immediately adjacent nouns that match case, gender, number Genitive nouns restrict non-genitive nouns that immediately precede Articles modify immediately following nouns (or noun phrases, if adjectives & genitives already attached to noun) or adjectives that match case, gender, number Nouns (or noun phrases, if adjectives & genitives already attached to noun) apposition to immediately following nouns (or noun phrases, if adjectives & genitives already attached to noun) that match case, gender, number Non-nominative nouns (or noun phrases, if adjectives & genitives already attached to noun) are objects to prepositions that immediately precede

Step 3: Determine the presence of subjects (if any) Nominative nouns or pronouns that belong in the same predication as a verb that match verb in person & number get automatically labelled S Special case 1: If two nominative nouns (or noun phrases, if adjectives & genitives already attached to noun) form the two arguments of the verb & the verb is a "to be" verb, one is S & one is the predicate complement Special case 2: If two nominative nouns (or noun phrases, if adjectives & genitives already attached to noun) are apparently separated into a predication without a verb to form the core predication, one is S & one is the predicate complement Special case 3: If accusative noun immediately adjacent to infinitive, may be subject of infinitival clause

Step 4: Analyze into arguments & adjuncts Prepositional phrases automatically labeled P (hidden) adjuncts Most accusative nouns (or noun phrases, if adjectives & genitives already attached to noun) automatically labeled P (hidden) complement (hidden) patient/direct object Most dative nouns (or noun phrases, if adjectives & genitives already attached to noun) automatically labeled P (hidden) complement (hidden) recipient/indirect object Ask whether animate or inanimate--if inanimate, dative noun automatically switches to P (hidden) adjuncts instead Ask whether location or time--if location or time, dative or accusative noun automatically switches to P (hidden) adjuncts locative/temporal (according to whether location or time is indicated) instead If a predication has either a complement (hidden) patient/direct object or a complement (hidden) recipient/indirect object, the verb in the predication gets the additional label transitive; all other verbs automatically get labeled intransitive

II. Manual analysis with semi-automated assistance If one is proceeding manually, then the higher labels like S, V, Patient/Direct Object, Recipient/Indirect Object would imply the lower hidden labels--have them automatically added in. Maybe as a check on manual analysis, when doing the labeling, if a user is using a more sophisticated editor tool, it would ask questions like: animate? accusative? When the user tries to label something as Patient/Direct Object. If the answer is no, the user can indicate it is an exception & maybe even add a notation on what/why. (Likewise with Recipient/Indirect Object on whether animate or dative.) If a verb has a Patient/Direct Object or Recipient/Indirect Object, it automatically is labeled transitive. Users can also chose to change verbs automatically labeled intransitive to transitive & to indicate elided Patient/Direct Object &/or Recipient/Indirect Object.

Users can use any tool to build their files & the semi-automated assistance could come during the process of annotation (if using a more sophisticated editor tool) or post-processing (provided the core labels are consistently applied, the additional data can be automatically added in by a post-processing script).

jonathanrobie commented 7 years ago

I opened a new issue (#23) for the editing environment Randall describes here.