CopticScriptorium / corpora

Public repository for Coptic SCRIPTORIUM Corpora Releases
31 stars 13 forks source link

Commas in non-splittable translation metadata #37

Open amir-zeldes opened 4 years ago

amir-zeldes commented 4 years ago

NT translation field contains commas, which are reserved for splittable translators in the CTS repo:

The Septuagint Version of the Old Testament, L.C.L. Brenton, 1851, available at <a href='https://ebible.org/eng-Brenton/'>ebible.org</a>

@lgessler has applied a patch to CTS repo to prevent splitting if any segmented is longer than 50 chars. Ultimately this metadatum should be fixed and possibly shortened, with faceted search in the repo in mind (we don't want to display a long value for users to search by). Once commas are removed, the fix should possibly be disabled.

lgessler commented 4 years ago

Adding to what @amir-zeldes wrote, the other Bible corpora (1 Corinthians, Mark, NT) have the translation value World English Bible (WEB). Something similar to that, preferably without a hyperlink (they cause all kinds of trouble that's best avoided) would be consistent and, I think, preferable unless there's a reason we want all this information in the translation field. Even L.C.L. Brenton might be the right value here--I think the rest of the data is probably better left either unexpressed under translation or moved into other metadata fields.

ctschroeder commented 4 years ago

Oh yes those commas would be a problem! I'm fine with L.C.L. Brenton 1851 in "translation" and moving more info+link to "source" or "source_info". (Source is usually the names of people; source_info might be better? But really it doesn't matter to me.)

amir-zeldes commented 4 years ago

Alright, I've added it to the OT metagenerator script, so whenever we rerun that corpus it should get fixed. I'm not retroactively fixing it in ANNIS/GH, since that would constitute a new version, but we can aim to re-do the Bible corpora with the newest NLP in an upcoming release.