Open PonteIneptique opened 2 years ago
True. This complicated though. What people understand as "original" varies a lot: some people consider the dissimilation of i/j as faithful to the original, some don't… corrected
is somewhat ambiguous though, because you can
Abbréviations and special characters make more sense to me because it is more factual
I think we can have many values, with descriptions, and Dissimilation of i/j
for example is for example one. We "just" need to make a list, this is a first step.
It might be a rabbit hole… i/j, but also u/v, but then what do you do with s/ſ, etc.?
I'd actually cover those. u/v i/j go together generally, s/S less. So I'd have an entry for these specifically
The idea is to make the catalog as machine actionable as possible. If it means somewhat fine-grained values, then we go for it.
So, I'd keep s/S as a single category, because it's something a "lot" of people are talking about, and it's a known problem in modern OCR systems. Unlike m / n So you would have
I'd actually add the option Ligatures kept
which span in multiple scripts AFAIK.
Corrected spelling / Original spelling / Normalized spelling ? No: the original spelling or the normalised spelling can both be corrected. I would go for:
I would prefer a long answer with a more descriptive thingie (like the parenthesis I wrote up there, you can give examples :D )
I don't get how it would look like at the end with the parenthesis
So, could you elaborate on this one, maybe with an example ? I know I am tiresome :)
original/modified (i.e. as diplomatic as possible or editorial interventions)
original/modified: for instance:
corrected/not corrected (imagining that the texts gives the lesson arriuoiz for arriuoit, which is an abvious mistake)
With original/modified I try to bypass the whole complicated terminology about transcription and simplify the problem to: "what did you try to do?" be more faithful or more interpretative? The answer would not have to be super script
That's clearer, now I need other people to wheight in :)
@alix-tz What do you think ?
Since we will probably not get it right from the first attempt, I think it could be interesting to create an escapable option in order for people who did something different than plainly copying the text to be able to signal it, if the options we propose don't include what they did. I've aways been told that when you do an annotation guide, you should have a joker token in order not to lose traces of unclear situations.
Otherwise, would you consider adding something about the ponctuation specifically? Or would you consider it goes in the normalized vs. original categories?
I'm also thinking of projects which include printed and manuscript texts: sometimes 1 character can be used to signal "printed" when the transcriber only wanted to focus on the handwritten sections. Should we have an entry to designate that?
EDIT: let me clarify here:
Say you have :
Case A would transcribe
- out about
- found those listed
Case B would transcribe (replace x by anything):
- xxx xxxx
- found those listed
or
- out about
- x x x
Case C could be
- (not even segmented)
- foud out about those listed
What about rare glyphs, like glyphs which would be specific to a writer? Situations could be:
EDIT: let me clarify with an example here too:
Say you have (sorry for the example):
Case A: "M. Machin est au centre de cette société."
Case B: "M. Machin est au X de cette X." (X every time there's a glyph, be it this one or another one)
Case C: "M. Machin est au Ꙩ de cette σé" (an attempt to find a character for every glyph)
For the distinction "original / modified" and "corrected / not corrected", indeed, it's probably best to keep it simple, so I agree with the last proposition.
It might already be complicated in some cases to find where the frontier between not corrected and corrected lays.
HI @alix-tz ,
Just to be clear, regarding
I think it could be interesting to create an escapable option in order for people who did something different than plainly copying the text to be able to signal it,
I think this list of value would be optional, non enforcable. The original transcription free text value would remain the main field. This is just to allow, down the road, a better machine actionable way to treat transcription guidelines.
I understood it as optional indeed. Maybe I'm making it more complicated than necessary so we can drop this aspect of my remark and eventually go back at it if relevant when we have stopped our list. But to be clear, I'm thinking of possible cases where the transcription differ from the original but where the producer would consider that none of the options correspond to a certain aspect of what they did. Then when you parse this information, you might mistakenly classify a dataset as "close to the orignal one" (or something like this) because it didn't check boxes suggesting otherwise.
I agree with Alix: punctuation and spelling are two different things -- especially for medievalists.
We use e.g. Leyden plus similar annotation for inline additions and deletions. Aleph Lamed ligature instead of single aleph lamed if it is ligatured in the original. Now with annotations in eScr this might get extended to other letter combinations via tags. I think it would be useful to give also the character set and its distribution.
We still need some values for this cases. The thing is, we can add as many as we want, as long as it is helpful. Distinction of u/v & i/j for example can be a value
Le mer. 23 mars 2022 à 8:07 AM, Simon Gabay @.***> a écrit :
True. This complicated though. What people understand as "original" varies a lot: some people consider the dissimilation of i/j as faithful to the original, some don't… corrected is somewhat ambiguous though, because you can
- correct a mistake in the document while transcribing
- "correct" the spelling, which is more a "linguistic normalisation" to me
Abbréviations and special characters make more sense to me because it is more factual
— Reply to this email directly, view it on GitHub https://github.com/HTR-United/schema/issues/5#issuecomment-1075996889, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZT7XGKGKTFJOQ67VHLVBK7LNANCNFSM5RLRW6UQ . You are receiving this because you authored the thread.Message ID: @.***>
Hello,
Thibault pointed out your discussion. I have not yet fully worked through the discussion. Both I would like to contribute some thoughts and point out solutions. From my point of view, the problem presents itself on several levels. Level 1. The problem that a lot of GT datasets from the same source can differ is a concern for me as well as for our project. Level 2. In what way can existing digital texts, editions.... can also be converted into GT. To what measure is digital text different from the original?
As I have taken from the discussion.
DefaultLine, or DropCapitalLine, DefaultLine?
From OCR-D side we have defined a level system. Which is generally and in particular defined in the OCR-D-GT Guidelines: https://ocr-d.de/en/gt-guidelines/trans/trLevels.html.
Regarding the characters... corresponding level related rule sets are generated based on the guidelines. see: https://github.com/tboenig/gt-guidelines/tree/gh-pages/rules
This one has been in my head for quite a long time.
Right now, we have free text, which means it is not machine actionable. I'd like to have the ability to populate a list of acceptable value, such as
Resolved Abbreviation
,Unresolved Abbreviation
,Corrected Spelling
,Original Spelling
,Special Character used
(eg. MUFI), things like that. It would go alongside the transcription guidelines but would make the whole thing a little more machine actionable.