HTR-United / schema

Repository for schema related business
Mozilla Public License 2.0
1 stars 1 forks source link

Transcription guidelines field: have a controled vocabulary of values #5

Open PonteIneptique opened 2 years ago

PonteIneptique commented 2 years ago

This one has been in my head for quite a long time.

Right now, we have free text, which means it is not machine actionable. I'd like to have the ability to populate a list of acceptable value, such as Resolved Abbreviation, Unresolved Abbreviation, Corrected Spelling, Original Spelling, Special Character used (eg. MUFI), things like that. It would go alongside the transcription guidelines but would make the whole thing a little more machine actionable.

gabays commented 2 years ago

True. This complicated though. What people understand as "original" varies a lot: some people consider the dissimilation of i/j as faithful to the original, some don't… corrected is somewhat ambiguous though, because you can

Abbréviations and special characters make more sense to me because it is more factual

PonteIneptique commented 2 years ago

I think we can have many values, with descriptions, and Dissimilation of i/j for example is for example one. We "just" need to make a list, this is a first step.

gabays commented 2 years ago

It might be a rabbit hole… i/j, but also u/v, but then what do you do with s/ſ, etc.?

PonteIneptique commented 2 years ago

I'd actually cover those. u/v i/j go together generally, s/S less. So I'd have an entry for these specifically

PonteIneptique commented 2 years ago
PonteIneptique commented 2 years ago

The idea is to make the catalog as machine actionable as possible. If it means somewhat fine-grained values, then we go for it.

gabays commented 2 years ago
PonteIneptique commented 2 years ago

So, I'd keep s/S as a single category, because it's something a "lot" of people are talking about, and it's a known problem in modern OCR systems. Unlike m / n So you would have

I'd actually add the option Ligatures kept which span in multiple scripts AFAIK.

gabays commented 2 years ago

Corrected spelling / Original spelling / Normalized spelling ? No: the original spelling or the normalised spelling can both be corrected. I would go for:

PonteIneptique commented 2 years ago

I would prefer a long answer with a more descriptive thingie (like the parenthesis I wrote up there, you can give examples :D )

gabays commented 2 years ago

I don't get how it would look like at the end with the parenthesis

PonteIneptique commented 2 years ago

So, could you elaborate on this one, maybe with an example ? I know I am tiresome :)

original/modified (i.e. as diplomatic as possible or editorial interventions)

gabays commented 2 years ago

original/modified: for instance:

  1. il arriuoit dans l'vniuers Parisien -> original
  2. il arrivoit dans l'univers Parisien -> semi-diplo=modified
  3. il arrivait dans l'univers parisien -> full normalisation=modified
gabays commented 2 years ago

corrected/not corrected (imagining that the texts gives the lesson arriuoiz for arriuoit, which is an abvious mistake)

  1. il arriuoiz dans l'vniuers Parisien -> not corrected (and original)
  2. il arrivoiz dans l'univers Parisien -> not corrected (and modified)
  3. il arrivait dans l'univers parisien -> corrected (and modified)
gabays commented 2 years ago

With original/modified I try to bypass the whole complicated terminology about transcription and simplify the problem to: "what did you try to do?" be more faithful or more interpretative? The answer would not have to be super script

PonteIneptique commented 2 years ago

That's clearer, now I need other people to wheight in :)

PonteIneptique commented 2 years ago

@alix-tz What do you think ?

alix-tz commented 2 years ago

Since we will probably not get it right from the first attempt, I think it could be interesting to create an escapable option in order for people who did something different than plainly copying the text to be able to signal it, if the options we propose don't include what they did. I've aways been told that when you do an annotation guide, you should have a joker token in order not to lose traces of unclear situations.


Otherwise, would you consider adding something about the ponctuation specifically? Or would you consider it goes in the normalized vs. original categories?


I'm also thinking of projects which include printed and manuscript texts: sometimes 1 character can be used to signal "printed" when the transcriber only wanted to focus on the handwritten sections. Should we have an entry to designate that?

EDIT: let me clarify here:

Say you have : image

Case A would transcribe

- out about
- found those listed

Case B would transcribe (replace x by anything):

- xxx xxxx
- found those listed

or

- out about
- x x x

Case C could be

- (not even segmented)
- foud out about those listed

What about rare glyphs, like glyphs which would be specific to a writer? Situations could be:

EDIT: let me clarify with an example here too:

Say you have (sorry for the example): image

Case A: "M. Machin est au centre de cette société."

Case B: "M. Machin est au X de cette X." (X every time there's a glyph, be it this one or another one)

Case C: "M. Machin est au Ꙩ de cette σé" (an attempt to find a character for every glyph)

alix-tz commented 2 years ago

For the distinction "original / modified" and "corrected / not corrected", indeed, it's probably best to keep it simple, so I agree with the last proposition.

It might already be complicated in some cases to find where the frontier between not corrected and corrected lays.

PonteIneptique commented 2 years ago

HI @alix-tz ,

Just to be clear, regarding

I think it could be interesting to create an escapable option in order for people who did something different than plainly copying the text to be able to signal it,

I think this list of value would be optional, non enforcable. The original transcription free text value would remain the main field. This is just to allow, down the road, a better machine actionable way to treat transcription guidelines.

alix-tz commented 2 years ago

I understood it as optional indeed. Maybe I'm making it more complicated than necessary so we can drop this aspect of my remark and eventually go back at it if relevant when we have stopped our list. But to be clear, I'm thinking of possible cases where the transcription differ from the original but where the producer would consider that none of the options correspond to a certain aspect of what they did. Then when you parse this information, you might mistakenly classify a dataset as "close to the orignal one" (or something like this) because it didn't check boxes suggesting otherwise.

gabays commented 2 years ago

I agree with Alix: punctuation and spelling are two different things -- especially for medievalists.

dstoekl commented 2 years ago

We use e.g. Leyden plus similar annotation for inline additions and deletions. Aleph Lamed ligature instead of single aleph lamed if it is ligatured in the original. Now with annotations in eScr this might get extended to other letter combinations via tags. I think it would be useful to give also the character set and its distribution.

PonteIneptique commented 1 year ago

We still need some values for this cases. The thing is, we can add as many as we want, as long as it is helpful. Distinction of u/v & i/j for example can be a value

Le mer. 23 mars 2022 à 8:07 AM, Simon Gabay @.***> a écrit :

True. This complicated though. What people understand as "original" varies a lot: some people consider the dissimilation of i/j as faithful to the original, some don't… corrected is somewhat ambiguous though, because you can

  • correct a mistake in the document while transcribing
  • "correct" the spelling, which is more a "linguistic normalisation" to me

Abbréviations and special characters make more sense to me because it is more factual

— Reply to this email directly, view it on GitHub https://github.com/HTR-United/schema/issues/5#issuecomment-1075996889, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZT7XGKGKTFJOQ67VHLVBK7LNANCNFSM5RLRW6UQ . You are receiving this because you authored the thread.Message ID: @.***>

Boenig commented 11 months ago

Hello,

Thibault pointed out your discussion. I have not yet fully worked through the discussion. Both I would like to contribute some thoughts and point out solutions. From my point of view, the problem presents itself on several levels. Level 1. The problem that a lot of GT datasets from the same source can differ is a concern for me as well as for our project. Level 2. In what way can existing digital texts, editions.... can also be converted into GT. To what measure is digital text different from the original?

As I have taken from the discussion.

  1. the characters, glyphs....
  2. I have not yet found structures: e.g. now with this template image

DefaultLine, or DropCapitalLine, DefaultLine?

From OCR-D side we have defined a level system. Which is generally and in particular defined in the OCR-D-GT Guidelines: https://ocr-d.de/en/gt-guidelines/trans/trLevels.html.

Regarding the characters... corresponding level related rule sets are generated based on the guidelines. see: https://github.com/tboenig/gt-guidelines/tree/gh-pages/rules