ebeshero / Amadis-in-Translation

a project to apply TEI markup to investigate early modern Spanish editions of Amadis de Gaula and their translations into English and French from the 1500s to the early nineteenth century.
http://amadis.newtfire.org
GNU Affero General Public License v3.0
4 stars 6 forks source link

OCR leads for early modern historical fonts #62

Open ebeshero opened 8 years ago

ebeshero commented 8 years ago

Let's take a look at http://emop.tamu.edu/outcomes/Franken-Plus And let's collect other such potential leads here. @setriplette @HelenaSabel

ebeshero commented 7 years ago

@halperta Greetings Hannah! We've just been talking at the SHARP 2017 conference about your transcription tool using Ocular! @setriplette and @HelenaSabel, Hannah Alpert-Abrams at UT Austin says she can definitely help us auto-generate Montalvo transcriptions, and it helps a lot that we have some transcriptions done by hand to get started.

Hannah, here's the part of our repo with the photofacsimiles of Montalvo: https://github.com/ebeshero/Amadis-in-Translation/tree/master/book-images/Montalvo-1547

And, here are the TEI XML files containing our hand transcriptions. The ones with full <cl> markup in them with @xml:ids are the reliable files to work with...though there are several we've had students producing (mostly with very light XML markup) that have a bunch of errors we know we need to fix. Stacey can help orient! Thanks HUGELY for offering to help, and I hope our project can be useful to you!

halperta commented 7 years ago

Hello! I'm sending you some automatically produced transcriptions of the Montalvo document.

montalvo_transcriptions.zip

I played a little with the parameters but this is a pretty basic "dirty" transcription using Ocular (https://github.com/tberg12/ocular). The attached folder contains both xml (ALTO) and plain text transcriptions; I didn't sort them, but I'm sure you can separate them to make it easier to review. Take a look and let me know what you think! And let me know if you want to discuss further.

Hannah

halperta commented 7 years ago

Update: after sending this to you, I reviewed some sample docs for myself and was surprised by how inaccurate it is. I'm not sure if the parameters were off, or what. I'm going to look into it this week and get back to you :)

ebeshero commented 7 years ago

Thanks, @halperta ! We'll wait until you have another go at it...

halperta commented 7 years ago

Okay! I have some results for you. They are... inconsistent, but feel free to take a look at the sample pages on this website:

http://www.halperta.com/amadis/

I'm showing two kinds of automatically produced transcriptions. The "automatic transcriptions" show specific kinds of errors, as described int he text on the website. The "normalized" page corrects some of those errors automatically, making a more readable text... though it has its own problems, including a lack of page breaks. Anyway, I hope you'll be patient with the "dirtiness" of the OCR, and let me know if you see any way that these transcriptions could be useful to you.

Hannah

setriplette commented 7 years ago

This is really interesting! The non-normalized one is actually closer to the text. It could help us transcribe, especially because it's appearing with an image beside and it's got some white space. Correcting that would probably be quicker than working from scratch.

setriplette commented 7 years ago

Ooh, except there's a major problem. It's reading straight across the column break instead of top to bottom down the first column and top to bottom down the second column. Is there anything to do about that?

halperta commented 7 years ago

Hello, I'm so sorry for not replying to this, I'm new to this GitHub conversation thread and finding it very confusing! If you're still interested can we take the conversation over to email? I'm at halperta@gmail.com

Ocular can't handle columns, so we've been doing manual cropping. But don't be afraid, it's better than it sounds: we found a system that "stacks" pages so you can crop across all of the multicolumn pages at once. It's not the best, and can definitely be slow and unwieldy, but if you think it's worth trying I can show you how it's done.

Hannah

ebeshero commented 7 years ago

@halperta Good to know about Ocular, and yes, we are definitely interested in working with it. Now, if tagging you on the Issues board works, GitHub should send you an email message. I really need to keep all the project management discussion tidily on the Issues boards of our GitHub repo because my email is a wild jungle, but if you receive this in your inbox, perhaps we just need to remember to tag you as I did here. Please let me know if/when you receive this, Hannah, and thanks for the great input!

halperta commented 7 years ago

Okay, that should be fine. Can you confirm that replying from my inbox is working and you received this message in return?

Thrilled you want to work with ocular! Let me try to find the cropping instructions that Taylor Berg-Kirkpatrick painstakingly wrote up, and I'll send them to you. We used them on the first folio and it worked well.

On Jul 22, 2017 6:57 PM, "Elisa Beshero-Bondar" notifications@github.com wrote:

@halperta https://github.com/halperta Good to know about Ocular, and yes, we are definitely interested in working with it. Now, if tagging you on the Issues board works, GitHub should send you an email message. I really need to keep all the project management discussion tidily on the Issues boards of our GitHub repo because my email is a wild jungle, but if you receive this in your inbox, perhaps we just need to remember to tag you as I did here. Please let me know if/when you receive this, Hannah, and thanks for the great input!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Amadis-in-Translation/issues/62#issuecomment-317215781, or mute the thread https://github.com/notifications/unsubscribe-auth/ADxgN2bLMGmSkFLXwobUWp_Y_XGNajVmks5sQn5VgaJpZM4JDQ8X .

halperta commented 7 years ago

Okay, found the info!

We were using a program called imagej to do the cropping, which is a wysiwyg interface for image editing that you can get here https://imagej.nih.gov/ij/. As I recall, we followed these instructions:

https://tinyapps.org/blog/misc/201305030700_crop_images_like_briss.html

We used a program to help us sort our files. You can download the program here: ​ manual_crop_demo.zip https://drive.google.com/file/d/0B8eDwPGQfEV9M3M0Q3NZbWRNQ1E/view?usp=drive_web ​ Here is how we ran the program and cropped the files. I recommend breaking the book up into smaller subsets of pages, as I think you have already done. The program will sort into verso and recto images, and create folders for left and right columns --- all helpful in cropping. Then these steps walk you through the process:

Good luck! Hannah

On Sat, Jul 22, 2017 at 7:51 PM, Hannah Alpert-Abrams halperta@gmail.com wrote:

Okay, that should be fine. Can you confirm that replying from my inbox is working and you received this message in return?

Thrilled you want to work with ocular! Let me try to find the cropping instructions that Taylor Berg-Kirkpatrick painstakingly wrote up, and I'll send them to you. We used them on the first folio and it worked well.

On Jul 22, 2017 6:57 PM, "Elisa Beshero-Bondar" notifications@github.com wrote:

@halperta https://github.com/halperta Good to know about Ocular, and yes, we are definitely interested in working with it. Now, if tagging you on the Issues board works, GitHub should send you an email message. I really need to keep all the project management discussion tidily on the Issues boards of our GitHub repo because my email is a wild jungle, but if you receive this in your inbox, perhaps we just need to remember to tag you as I did here. Please let me know if/when you receive this, Hannah, and thanks for the great input!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Amadis-in-Translation/issues/62#issuecomment-317215781, or mute the thread https://github.com/notifications/unsubscribe-auth/ADxgN2bLMGmSkFLXwobUWp_Y_XGNajVmks5sQn5VgaJpZM4JDQ8X .

ebeshero commented 7 years ago

Thanks for these instructions, @halperta ! It sounds like we should be able to fine-tune the program to improve the results, and we can set to work on that in the coming months. Also, yes indeed, I am reading you loud and clear! I saw your replies in my email as well as here on our GitHub Issues thread.