CambridgeSemiticsLab / nena_corpus

The NENA corpus in plain-text markup
Creative Commons Attribution 4.0 International
2 stars 2 forks source link

Simplify and Agree on .nena formatting guidelines #7

Closed codykingham closed 4 years ago

codykingham commented 4 years ago

The .nena formatting guidelines are now a bit old and have moved past the draft stage. As @jamespstrachan builds the text input tool, we should think more carefully about what should absolutely go in to the .nena format and what we should leave out as an unnecessary complication.

An example of a feature in the draft documentation that is probably superfluous is:

line breaks – marked with / and //. These were intended to preserve poetic indentations from .docx sources. But those features seem less relevant in relation to the new audio files.

A questionable example is comments – which are surrounded in brackets and marked with a speaker: (GK: text of interjection?). Do we want to keep this kind of data in the .nena format? @GeoffreyKhan is this kind of thing something you need to be able to do?

Some things that should absolutely be kept include language markers. The suggested markup currently is, e.g., <E>Hello<E>. So maybe this should currently be done in the same way while inputting text? In .docx these values are normally indicated via superscript letters. @GeoffreyKhan would you be comfortable placing <> tags around such letters when you do your copying/pasting?

GeoffreyKhan commented 4 years ago

A questionable example is comments https://github.com/CambridgeSemiticsLab/nena_corpus#comments – which are surrounded in brackets and marked with a speaker: |(GK: text of interjection?)|. Do we want to keep this kind of data in the |.nena| format? @GeoffreyKhan https://github.com/GeoffreyKhan is this kind of thing something you need to be able to do?

GK: These are not necessary for the database.

Some things that should absolutely be kept include language markers https://github.com/CambridgeSemiticsLab/nena_corpus#text-markup. The suggested markup currently is, e.g., |Hello|. So maybe this should currently be done in the same way while inputting text? In |.docx| these values are normally indicated via superscript letters. @GeoffreyKhan https://github.com/GeoffreyKhan would you be comfortable placing |<>| tags around such letters when you do your copying/pasting?

GK: That would be fine.

thanks

Geoffrey

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CambridgeSemiticsLab/nena_corpus/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMC4DG4323333GDLJQ66MV3RH6VHFANCNFSM4LNUWK4Q.

-- Geoffrey Khan Regius Professor of Hebrew University of Cambridge

Faculty of Asian and Middle Eastern Studies Sidgwick Avenue Cambridge CB3 9DA UK

codykingham commented 4 years ago

This has been implemented in https://github.com/CambridgeSemiticsLab/nena_corpus/blob/master/docs/nena_format.md

Also see /standards.

And it is now incorporated into a new parser.