trankribus transcriptions

PietroLiuzzo commented 4 years ago

@abausi @DenisNosnitsin1970 @thea-m @DariaElagina I would like to discuss the workflow to integrate transcriptions done in Transkribus into our data. From a preliminary conversation with @eu-genia we think that the best and easiest thing to do would be the following, and applies of course only to a transcription which can be made available, not to the material provided for the training. The training of the model continues in parallel, which is to say, when we will have more pages transcribed and corrected the training will be done newly so that we can hope in better results. The current version performs well in most cases.

The list of manuscripts for transcription now include

ESap046
ESgg011
EMML4398
EMML1859
EStzm001
BNFet677
EMML1939
ESam019
BLor718
ESqdq004
ESqsm017

After these the Laurenziana, EMML1763 and BNF manuscript are in the pipelines.

These will be aligned, the segmentation will be fixed and they will be automatically transcribed and corrected where a correct transcription has been provided. The TEI exported from Transkribus will then be transformed with an XSLT so that it is valid for our schema, and copy pasted into the manuscript file. this transformation will bring into the TEI file a large number of elements, including pb, cb, lb for each part of transcription and a large section of surfaces with explicit links to the regions of the images transcribed linked to each line, column and page. So, quite a lot of nice information we will be able to reuse for many different purposes. When adding this, we will also

[x] add a sourceDesc paragraph about the fact that the transcription is done with Transkribus
[x] add a change element detailing the transcription work carried out (corrected, not corrected, precision declared by the model, how has done the alignment and when, etc.)
[x] note in the transcription div that the status of the transcription
[x] add a @subtype='transkribus' to the div so that
- [x] the text can be rendered in another color, e.g. gray and it can become clear that the transcription is automatic and thus has mistakes.

publishing this directly, as it is our practice with everything, will make these texts immediately available and linked to their images, thus hopefully facilitating both the cataloguing work (no need to type all incipits, perhaps just correct what is there) and for any other task of text retrieval, in BM and Dillmann. It does mean publishing quite a lot of not corrected material, but that is how the research environment has worked until now, attracting contributions in this case also in the form of corrections to the existing transcription.

If you have anything to comment, add, disagree with, I would really very much appreciate.

DenisNosnitsin1970 commented 4 years ago

Dear Pietro, that is all fne. I will send some more transcribed pages of D 781 and RQG-034, as soon as t is available. Have a good weekend, regards - DN

Am Freitag, 28. August 2020, 16:11:18 MESZ hat Pietro Liuzzo <notifications@github.com> Folgendes geschrieben:

@abausi @DenisNosnitsin1970 @thea-m @DariaElagina I would like to discuss the workflow to integrate transcriptions done in Transkribus into our data. From a preliminary conversation with @eu-genia we think that the best and easiest thing to do would be the following, and applies of course only to a transcription which can be made available, not to the material provided for the training. The training of the model continues in parallel, which is to say, when we will have more pages transcribed and corrected the training will be done newly so that we can hope in better results. The current version performs well in most cases.

The list of manuscripts for transcription now include

ESap046
ESgg011
EMML4398
EStzm001
BNFet677
EMML1939
ESam019
BLor718
ESqdq004

After these the Laurenziana, EMML1763 and BNF manuscript are in the pipelines.

These will be aligned, the segmentation will be fixed and they will be automatically transcribed and corrected where a correct transcription has been provided. The TEI exported from Transkribus will then be transformed with an XSLT so that it is valid for our schema, and copy pasted into the manuscript file. this transformation will bring into the TEI file a large number of elements, including pb, cb, lb for each part of transcription and a large section of surfaces with explicit links to the regions of the images transcribed linked to each line, column and page. So, quite a lot of nice information we will be able to reuse for many different purposes. When adding this, we will also

add a sourceDesc paragraph about the fact that the transcription is done with Transkribus
add a change element detailing the transcription work carried out (corrected, not corrected, precision declared by the model, how has done the alignment and when, etc.)
note in the transcription div that the status of the transcription
add a @subtype='transkribus' to the div so that
- the text can be rendered in another color, e.g. gray and it can become clear that the transcription is automatic and thus has mistakes.

publishing this directly, as it is our practice with everything, will make these texts immediately available and linked to their images, thus hopefully facilitating both the cataloguing work (no need to type all incipits, perhaps just correct what is there) and for any other task of text retrieval, in BM and Dillmann. It does mean publishing quite a lot of not corrected material, but that is how the research environment has worked until now, attracting contributions in this case also in the form of corrections to the existing transcription.

If you have anything to comment, add, disagree with, I would really very much appreciate.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

thea-m commented 4 years ago

Thank you! It is fine for me. I assume that there is no problem using in this way the images of EMML manuscripts that are online (EMML4398 and EMIP1859=EMML 6459). The images of EMIP 1859 were given to me by Steve and can themselves probably not be made public, but they are available online as EMML 6459 on vHMML (as a digitised microfilm of course), even though our transcription was made from a different set of images. I do have personal problems with the lbs in the files (decrease of readability), but maybe you can help me with that separatedly.

abausi commented 4 years ago

I also agree, this seems to be a very good proposal. Only note that the BML MSS were proposed because this is a way to implement a cooperation with a library that has a small, but important historical collection. Yet, only fragmented transcriptions from this collection are available at the moment, as far as I know.

Ralph-Lee-UK commented 4 years ago

The THEOT project has now several transcriptions that have been checked thoroughly, and are now ready for this step. Maybe we could set up a trial so that we can be sure that future transcriptions are done correctly? We have, for instance, now several full transcriptions of Deuteronomy, as well as several other OT books. @PietroLiuzzo perhaps we should have a conversation about this shortly?

PietroLiuzzo commented 4 years ago

There is now a PR to the Guidelines documenting this and the pipeline is completed.

There is no way to scale up or speed up this work.

The transformation from the Transkribus TEI to the BM TEI is available here

The app supports now the visualization of such automatic transcriptions in gray. @xml:lang has been omitted not to activate language related tools, which would, on wrong text, became confusing. can easily be added when text is corrected.

Once enough new pages of transcription will be entered from those of the Qemer set, the model will be trained again and tested on ESgg011 and following, from the list above.

ESap046 is already transformed and has been sent back to @DenisNosnitsin1970 for checking end eventually correcting. If it is acceptable it can already be pushed to the DB (I can do that).

PietroLiuzzo commented 4 years ago

@thea-m <lb>s are so very useful! I can only suggest, for reading in oxygen, to switch to author view, so that all that structural markup does not get in the way.

I would like also, in this context, to provide support for the use of locus without @facs. Rationale of this is: if there is a transkription there are images, if the transkription can be made available, why should the images not also be available. if we have images and transcription, then the facsimile elements from transkribus are all we need to reach the exact regions of the images required, so, we do not need to add @facs and can use other attributes already present in locus. Since this is not already supported it is not part of Guidelines PR108 but ideally we will add there "if you have added a transcription, you do not need to use @facs in <locus>". So, how do we get this in the app? at the moment we make a popup which contains the correct range of images. This we can keep having the same behaviour also in these cases where a transkription will be available. But we can do more. for targeted loci we can extract the correct line, provided the reference is given up to the line. e.g., in that popup for a markup like

<locus from="13va9" to="13va10">13va ls.9-10</locus>

I can get those two lines only, images and text, into the popup, with additionally links to the text view. is this ok? would you wish for something else? for a different behaviour? Please share your ideas.

PietroLiuzzo commented 4 years ago

this is more or less what a first draft will look like of images picked up from the alignment instead of using @facs. I am not yet sure what to do wit large ranges spanning over several pages, probably I will just take the full images also in that case. while for a list of targets full pages or lines when indicated will be brought up. Clicking on the images will link to the viewer opened at the right image.

PietroLiuzzo commented 4 years ago

a file with an updated description, facsimiles and transcription will become rather large, about 1Mb and exceed the capabilities of the GitHub hook to handle it. When the load of this work will become considerable we may have to consider after all to xi:include these parts, or at least the facsimiles.

PietroLiuzzo commented 3 years ago

in transkribus2BM.xsl facsimiles should be grouped in one for each set like in DabraLibanosGG1. idno/@facs should, when possible, point at the facsimile xml:id, which should have a corresp to the iiif location. ids generated for facsimile should be copied over to surface.

BetaMasaheft / Documentation

trankribus transcriptions #1472