ScientificPublishing / SciPub

2 stars 0 forks source link

OCR priorities and test #2

Open artydont opened 1 year ago

artydont commented 1 year ago

This is a test to confirm all are subscribed to watching this repo and got an email of this new Issue. Please email or SMS text me to confirm you receieved this text by email.

Top OCR priority is to determine whether Ira Gollobin book needs to be rescanned before loan of hardcopy has to be returned.

I hope to provide an answer within a few days.

But if not, others will need to install on either Windows or Mac:

ABBY FineReader

Do learn how to use it NOW so we can be sure to process the book before loan expires.

Start here:

https://pdf.abbyy.com/media/1676/users_guide.pdf

Continue via resource center etc:

https://pdf.abbyy.com/resources/

I will be reading the ABBY docs too as probably better for understanding principles of what to do with Tesseract than any docs specific to it.

I don't use Windows or Apple but did notice the Windows version is available via torrents for longer than the free trial version:

https://therarbg.com/get-posts/keywords:ABBY:category:Apps/

found magnet link here:

https://therarbg.com/post-detail/087b52/abby-fine-reader-moebius88/

You need to have a torrent client to use the magnet link provided there. Also need to have working backup and restore in case file from unknown provider does damage.

Somebody else should find Apple version.

Rescan

If it turns out a rescan might be useful it would be best to start with the backmatter. We won't need the frontmatter with Library of Congress catalog card info since that is available online.

  1. Index. pp574-607. Particularly worth scanning to get names correct so as to add them to spelling dictionary. Remaining mis-spelled words will then almost all be OCR typos.
  2. Bibliography pp561-573. Similar reasons including additional words and phrases not in spelling dictionary. Will also need to process separately into standard bibliographic references hopefully fully automated but may require some manual markup whle proof reading.
  3. Notes pp507-560. Similar reasons with more complex processing and likely more manual markup to link to Bibliographic items from item 2.

Automated Processing But don't worry. ONLY finding out whether the scan resolution was sufficient is urgent (and might need ABBY FineReader software so requires reading its docs now).

Automated text processing after OCR is now "standard". I will write up some notes later. But here are some links for those curious:

https://www.digitisation.eu/knowledge/digitisation-tools/

https://guides.nyu.edu/tesseract/home

https://drops.dagstuhl.de/opus/volltexte/2023/18522/pdf/OASIcs-SLATE-2023-8.pdf

https://github.com/UB-Mannheim/ocr-fileformat

https://pdfa.org/resources/

Advanced and Translation Processing

Siemens, Ray_Schreibman, Susan_Susan Schreibman - A Companion to Digital Literary Studies-Wiley (2013).epub

http://library.lol/main/FB9723E570F187CF9FB78D231D0C80CD

https://companions.digitalhumanities.org/DLS/

https://libgen.is/book/index.php?md5=1CCF28190F6B9DEAF35AFC93F17D3DEC

Available Software

  1. Nuance Power PDF Standard 2.1 (dave)
  2. OmniPage 19 (dave)
  3. Tesseract tesseract 5.3.0, leptonica-1.83.1 with gImageReader 3.4.1 () on Fedora 38 Gnome with other items listed in NYU tesseract tutorial linked above (arty)

File Sharing

There are many pages in the wikipedia series on:

https://en.wikipedia.org/wiki/File_sharing

Philosophically and politically important to understand the implications of this stuff even if not able to get into technical details. We CANNOT be old-fangled.

Note especially the overlap with open science, academic publishing, wikimedia etc and:

https://en.wikipedia.org/wiki/Anna%27s_Archive

Also be aware of:

https://en.wikipedia.org/wiki/HathiTrust

https://www.hathitrust.org/

Later OCR work

Most stuff we are likely to work on for legally producing "Companion to" volumes probably won't need OCR scan and proof reading as that is readily available online.

One that probably will need scanning (but is not urgent like figuring out how to determine scan resolution) is more recent translation of this that only has the earlier translation publicly available online:

The road to power : political reflections on growing into the revolution / Karl Kautsky ; edited by John H. Kautsky ; new translation by Raymond Meyer.

https://catalog.hathitrust.org/Record/003077628

Tomism commented 1 year ago

hi

artydont commented 1 year ago

hi Tom