CambridgeSemiticsLab / nena_corpus

The NENA corpus in plain-text markup
Creative Commons Attribution 4.0 International
2 stars 2 forks source link

Character Table Needed #2

Closed codykingham closed 4 years ago

codykingham commented 4 years ago

We need to store character tables in an explicit .py file that can be referred to throughout the pipeline. This is especially relevant, for instance, when defining what a "letter" is in our corpus—a concept which is referred to in parsing .html source files but also the .nena files themselves. We have a diverse set of characters in the NENA corpus that needs to be carefully and exactly tracked. And we should also be able to easily modify or add new codes. Essentially, we need to have a dict on hand that can provide all "letter" characters for formatting regex strings. This might be done with a join:

letter = re.compile('|'.join(letter_set))
codykingham commented 4 years ago

We might also store char tables in a simple .tsv

codykingham commented 4 years ago

@jamespstrachan This old issue is relevant for the problem we're discussing on the "ground truth" with regard to validating letters. Perhaps I can establish a more authoritative list based on scripts we use to make conversions

codykingham commented 4 years ago

Done in a36484c6d229520622664d797a000c80e6ef27f8.

codykingham commented 4 years ago

Regex also handled in 18387fd42fe1e301a4013d6b6bbdbb556e6cfb3b.