CUNY-CL / latin_scansion

Apache License 2.0
0 stars 2 forks source link

Populates full protocol buffer message with intermediate representations #70

Closed kylebgorman closed 3 years ago

kylebgorman commented 3 years ago

This PR closes #64.

We apply the (functional, deterministic) normalization grammar and the pronunciation grammar to obtain the norm and raw_pron fields of the Verse message.

We then apply, in order, the variable rule, the syllable rule, the weight rule, and the hexameter rule (note that the general foot rule and the hexameter rule---a filter on the foot rule which requires the foot sequence to be a hexameter---have been merged, to minimize bookkeeping), projecting onto the output before each application so we keep the intermediate representations around in a lattice. Then, once we reach the end state, we check for failure (as in defective lines). If there has been no failure, we then compute the shortest path and work backwards via composition. We trivially apply shortest path at each stage even though the paths all have the same labeling: it is just a cheap way of obtaining a string transducer. We then "chunk" the intermediate transducer lattices to obtain the alignments, and convert these into message form.

The resulting Verse proto looks like the following (for Aen. 1.1):

verse {
  verse_number: 1
  text: "Arma virumque canō, Trojae quī prīmus ab ōris"
  norm: "arma virumque canō trojae quī prīmus ab ōris"
  raw_pron: "arma wirũːkwe kanoː trojjaj kwiː priːmus ab oːris"
  var_pron: "arma wirũːkwe kanoː trojjaj kwiː priːmu sa boːris"
  foot {
    syllable {
      nucleus: "a"
      coda: "r"
      weight: LONGUS
    }
    syllable {
      onset: "m"
      nucleus: "a"
      weight: BREVIS
    }
    syllable {
      onset: "w"
      nucleus: "i"
      weight: BREVIS
    }
    type: DACTYL
  }
  foot {
    syllable {
      onset: "r"
      nucleus: "ũː"
      weight: LONGUS
    }
    syllable {
      onset: "kw"
      nucleus: "e"
      weight: BREVIS
    }
    syllable {
      onset: "k"
      nucleus: "a"
      weight: BREVIS
    }
    type: DACTYL
  }
  foot {
    syllable {
      onset: "n"
      nucleus: "oː"
      weight: LONGUS
    }
    syllable {
      onset: "tr"
      nucleus: "o"
      coda: "j"
      weight: LONGUS
    }
    type: SPONDEE
  }
  foot {
    syllable {
      onset: "j"
      nucleus: "a"
      coda: "j"
      weight: LONGUS
    }
    syllable {
      onset: "kw"
      nucleus: "iː"
      weight: LONGUS
    }
    type: SPONDEE
  }
  foot {
    syllable {
      onset: "pr"
      nucleus: "iː"
      weight: LONGUS
    }
    syllable {
      onset: "m"
      nucleus: "u"
      weight: BREVIS
    }
    syllable {
      onset: "s"
      nucleus: "a"
      weight: BREVIS
    }
    type: DACTYL
  }
  foot {
    syllable {
      onset: "b"
      nucleus: "oː"
      weight: LONGUS
    }
    syllable {
      onset: "r"
      nucleus: "i"
      coda: "s"
      weight: LONGUS
    }
    type: SPONDEE
  }
}

Processing each Aeneid book takes just under 2s on my cheap laptop.

Known limitations of this PR: