CambridgeSemiticsLab / nena_corpus

The NENA corpus in plain-text markup
Creative Commons Attribution 4.0 International
2 stars 2 forks source link

Standardize comments in standard text markup #4

Closed hvlaardingerbroek closed 4 years ago

hvlaardingerbroek commented 5 years ago

The NENA texts contain some comments in round or square brackets. Round brackets are also used to indicate line numbers.

The following comments are attested: General remark: (interruption) in Urmi_C A3 'Axiqar', line 28 Introducing another speaker: (GK: ... ) 32 times in Urmi_C, with some variations: brackets can be round or square (mostly square), and once the colon is missing.

Should we decide on one standard way to encode such comments?

I suggest we choose square brackets for comments, with the special notation with colon to introduce a different speaker (typically the interviewer), with the restriction that no spaces can occur between the opening bracket and the colon, which is followed by a space. There is no need for special emphasis markers as the syntax is clear. e.g.: [interruption] [GK: ...]

codykingham commented 4 years ago

Solved by changing the string replacement to re.sub and using capture groups to modify parentheses to square brackets.

I think we should consider this a conversion issue rather than a source text issue, since it is not really a mistake in the original text, but only a variation. When we convert to .nena format, we work to funnel variations into single slots. This is a different situation compared to, e.g., missing line numbers in a file.

One exception that I made to this logic is with adding the missing colon in the comment from Urmi C, Village Life (6)

(45) (*GK* vàrdə?)

It might be argued that this case is indeed a mistake. However, for the sake of simplicity, I decided to treat it as another variation.

In standardizing the comments, I have also removed the unnecessary emphasis on the initials. So the example above now reads:

(45) [GK: vàrdə?]