TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
269 stars 88 forks source link

Clarify how to encode redacted/censored text #2421

Open gjvnq opened 1 year ago

gjvnq commented 1 year ago

I get the feeling that <gap> is the best element for redacted documents like this one however the documentation isn't very clear as <del> feels like a good contender.

I feel that the documention for <gap> should be updated to clarify it's the recomended way to encode redactions/censorship along with a proposed value for @reason, perhaps blackout, redaction, or censorship. By clarifying the documention I meand adding something like the text in bold below:

(gap) indicates a point where material has been omitted in a transcription, whether for editorial reasons described in the TEI header, as part of sampling practice, or because the material is illegible, **redacted, censored,** invisible, or inaudible.

Alternatively I guess a new <redacted> or <censor> element could be added. The advantage of such an approach is that once the full document is released, the text behind the blackouts can be just included inside the <censor> tags which I think doesn't fit well with the <gap> tag.

hcayless commented 1 year ago

I think this is a good idea. <gap> is probably a simpler choice than <del>, especially since you'd want to use it inside the deletion anyway if the text is illegible.

joeytakeda commented 1 year ago

Thanks @gjvnq — I definitely agree that this should be clarified. On the GL page for gap, it notes:

The gap tag simply signals the editors decision to omit or inability to transcribe a span of text. Other information, such as the interpretation that text was deliberately erased or covered, should be indicated using the relevant tags, such as del in the case of deliberate deletion.

So I'd be inclined to encode the example like so:

<del><gap reason="blackout" extent="multiple sentences"/></del>

or, if you wanted to flag the different types of deletion:

<del type="redaction"><gap reason="blackout" extent="multiple sentences"/></del>

But having some good examples of these kinds of redactions (and their relationship to ellipsis) in the GL and clarifying practices seems like a good idea to me.

gjvnq commented 1 year ago

This del > gap structure does seem like the best way forward without having to introduce a new tag.

Another question: should the tag gap include a text content made up of Unicode block elements (e.g. U+2588 )?

This has the advantage of making file conversions and text extraction easier but it might go against the current TEI guidelines.

Example:

<p>In this case, the defendant John Doe (real name: <del type="redaction"><gap>████████████████████</gap></del>) claimed that ....</p>

We could also use U+2592 for rendering illegible gaps.