Transcription guidelines field: have a controled vocabulary of values

PonteIneptique commented 2 years ago

This one has been in my head for quite a long time.

Right now, we have free text, which means it is not machine actionable. I'd like to have the ability to populate a list of acceptable value, such as Resolved Abbreviation, Unresolved Abbreviation, Corrected Spelling, Original Spelling, Special Character used (eg. MUFI), things like that. It would go alongside the transcription guidelines but would make the whole thing a little more machine actionable.

gabays commented 2 years ago

True. This complicated though. What people understand as "original" varies a lot: some people consider the dissimilation of i/j as faithful to the original, some don't… corrected is somewhat ambiguous though, because you can

correct a mistake in the document while transcribing
"correct" the spelling, which is more a "linguistic normalisation" to me

Abbréviations and special characters make more sense to me because it is more factual

PonteIneptique commented 2 years ago

I think we can have many values, with descriptions, and Dissimilation of i/j for example is for example one. We "just" need to make a list, this is a first step.

gabays commented 2 years ago

It might be a rabbit hole… i/j, but also u/v, but then what do you do with s/ſ, etc.?

PonteIneptique commented 2 years ago

I'd actually cover those. u/v i/j go together generally, s/S less. So I'd have an entry for these specifically

PonteIneptique commented 2 years ago

Resolved Abbreviation / Unresolved Abbreviation
Corrected Spelling (If orthographic mistakes are found, or printing error, you corrected them) / Original Spelling (Text copied as is)
Special Character used (MUFI)
Medieval Latin Scripts: Differentiated u-v &i-j / Kept original use of u-v & i-j
Medieval and Mordern Latin Scripts: Differentiated s/S / Only small s
Original Spacing / Spacing according to modern standards

PonteIneptique commented 2 years ago

The idea is to make the catalog as machine actionable as possible. If it means somewhat fine-grained values, then we go for it.

gabays commented 2 years ago

Dissimilation i/j and u/v are not medieval, it exists in modern French
Difference round s and long s can be seen as a glyph problem (two "shapes" for the same letter). It goes together with "n plongeant", "m plongeant", etc (la lettre dont le dernier jambage descend sous la ligne d'écriture pour faire simple, ce qui permet au copiste d'indiquer que le mot se termine, mais en gros ça reste un m, juste d'une forme différente. Je sais pas dire techniquement ça en anglais)
I would differentiate original and normalised, both can be potentially corrected

PonteIneptique commented 2 years ago

So, I'd keep s/S as a single category, because it's something a "lot" of people are talking about, and it's a known problem in modern OCR systems. Unlike m / n So you would have

Corrected spelling / Original spelling / Normalized spelling ? Could you make a parenthesis for each choice to make things clearer ? I wonder if it could not be "Corrected and/or normalized spelling" but tell me

I'd actually add the option Ligatures kept which span in multiple scripts AFAIK.

gabays commented 2 years ago

Corrected spelling / Original spelling / Normalized spelling ? No: the original spelling or the normalised spelling can both be corrected. I would go for:

original/modified (i.e. as diplomatic as possible or editorial interventions)
corrected/not corrected (i.e. obvious errors in the text are corrected while transcribing)

PonteIneptique commented 2 years ago

I would prefer a long answer with a more descriptive thingie (like the parenthesis I wrote up there, you can give examples :D )

gabays commented 2 years ago

I don't get how it would look like at the end with the parenthesis

PonteIneptique commented 2 years ago

So, could you elaborate on this one, maybe with an example ? I know I am tiresome :)

original/modified (i.e. as diplomatic as possible or editorial interventions)

gabays commented 2 years ago

original/modified: for instance:

il arriuoit dans l'vniuers Parisien -> original
il arrivoit dans l'univers Parisien -> semi-diplo=modified
il arrivait dans l'univers parisien -> full normalisation=modified

gabays commented 2 years ago

corrected/not corrected (imagining that the texts gives the lesson arriuoiz for arriuoit, which is an abvious mistake)

il arriuoiz dans l'vniuers Parisien -> not corrected (and original)
il arrivoiz dans l'univers Parisien -> not corrected (and modified)
il arrivait dans l'univers parisien -> corrected (and modified)

gabays commented 2 years ago

With original/modified I try to bypass the whole complicated terminology about transcription and simplify the problem to: "what did you try to do?" be more faithful or more interpretative? The answer would not have to be super script

PonteIneptique commented 2 years ago

That's clearer, now I need other people to wheight in :)

PonteIneptique commented 2 years ago

@alix-tz What do you think ?

alix-tz commented 2 years ago

Since we will probably not get it right from the first attempt, I think it could be interesting to create an escapable option in order for people who did something different than plainly copying the text to be able to signal it, if the options we propose don't include what they did. I've aways been told that when you do an annotation guide, you should have a joker token in order not to lose traces of unclear situations.

Otherwise, would you consider adding something about the ponctuation specifically? Or would you consider it goes in the normalized vs. original categories?

I'm also thinking of projects which include printed and manuscript texts: sometimes 1 character can be used to signal "printed" when the transcriber only wanted to focus on the handwritten sections. Should we have an entry to designate that?

EDIT: let me clarify here:

Say you have :

Case A would transcribe

- out about
- found those listed

Case B would transcribe (replace x by anything):

- xxx xxxx
- found those listed

or

- out about
- x x x

Case C could be

- (not even segmented)
- foud out about those listed

What about rare glyphs, like glyphs which would be specific to a writer? Situations could be:

original glyph (do you have a better idea of how to call them?) were not transcribed
a generic character was used to signal the presence of an original glyph
each original glyph was matched with a unique character

EDIT: let me clarify with an example here too:

Say you have (sorry for the example):

Case A: "M. Machin est au centre de cette société."

Case B: "M. Machin est au X de cette X." (X every time there's a glyph, be it this one or another one)

Case C: "M. Machin est au Ꙩ de cette σé" (an attempt to find a character for every glyph)

alix-tz commented 2 years ago

For the distinction "original / modified" and "corrected / not corrected", indeed, it's probably best to keep it simple, so I agree with the last proposition.

It might already be complicated in some cases to find where the frontier between not corrected and corrected lays.

PonteIneptique commented 2 years ago

HI @alix-tz ,

Just to be clear, regarding

I think it could be interesting to create an escapable option in order for people who did something different than plainly copying the text to be able to signal it,

I think this list of value would be optional, non enforcable. The original transcription free text value would remain the main field. This is just to allow, down the road, a better machine actionable way to treat transcription guidelines.

alix-tz commented 2 years ago

I understood it as optional indeed. Maybe I'm making it more complicated than necessary so we can drop this aspect of my remark and eventually go back at it if relevant when we have stopped our list. But to be clear, I'm thinking of possible cases where the transcription differ from the original but where the producer would consider that none of the options correspond to a certain aspect of what they did. Then when you parse this information, you might mistakenly classify a dataset as "close to the orignal one" (or something like this) because it didn't check boxes suggesting otherwise.

gabays commented 2 years ago

I agree with Alix: punctuation and spelling are two different things -- especially for medievalists.

dstoekl commented 2 years ago

We use e.g. Leyden plus similar annotation for inline additions and deletions. Aleph Lamed ligature instead of single aleph lamed if it is ligatured in the original. Now with annotations in eScr this might get extended to other letter combinations via tags. I think it would be useful to give also the character set and its distribution.

PonteIneptique commented 1 year ago

We still need some values for this cases. The thing is, we can add as many as we want, as long as it is helpful. Distinction of u/v & i/j for example can be a value

Le mer. 23 mars 2022 à 8:07 AM, Simon Gabay @.***> a écrit :

True. This complicated though. What people understand as "original" varies a lot: some people consider the dissimilation of i/j as faithful to the original, some don't… corrected is somewhat ambiguous though, because you can

correct a mistake in the document while transcribing

"correct" the spelling, which is more a "linguistic normalisation" to me

Abbréviations and special characters make more sense to me because it is more factual

— Reply to this email directly, view it on GitHub https://github.com/HTR-United/schema/issues/5#issuecomment-1075996889, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZT7XGKGKTFJOQ67VHLVBK7LNANCNFSM5RLRW6UQ . You are receiving this because you authored the thread.Message ID: @.***>

Boenig commented 11 months ago

Hello,

Thibault pointed out your discussion. I have not yet fully worked through the discussion. Both I would like to contribute some thoughts and point out solutions. From my point of view, the problem presents itself on several levels. Level 1. The problem that a lot of GT datasets from the same source can differ is a concern for me as well as for our project. Level 2. In what way can existing digital texts, editions.... can also be converted into GT. To what measure is digital text different from the original?

As I have taken from the discussion.

the characters, glyphs....
I have not yet found structures: e.g. now with this template

DefaultLine, or DropCapitalLine, DefaultLine?

From OCR-D side we have defined a level system. Which is generally and in particular defined in the OCR-D-GT Guidelines: https://ocr-d.de/en/gt-guidelines/trans/trLevels.html.

Regarding the characters... corresponding level related rule sets are generated based on the guidelines. see: https://github.com/tboenig/gt-guidelines/tree/gh-pages/rules

HTR-United / schema

Transcription guidelines field: have a controled vocabulary of values #5