ZoeLeBlanc / dissertation

Repository for my dissertation, "Circulating Anti-Colonial Cairo: The United Arab Republic, News Media, and The Struggle to Decolonize The International Information Order, 1952-1978"
2 stars 0 forks source link

Finalize OCR Accuracy Metrics #33

Open ZoeLeBlanc opened 6 years ago

ZoeLeBlanc commented 6 years ago

Context

  1. What's the issue that needs to be solved?
    • Currently the quality of OCR varies depending on if I take the time to identify and annotate paragraph separations.
    • I also may need to quantify the difference between the google ocr engine and transcribing the data
    • I need to know what's the error margin between ordered vs unordered OCR for various publications, and transcription vs ordered vs unordered
  2. How do you plan to solve/revise?
    • develop ocr accuracy methodology and test it before deploying it across my data

      Associated Issues

To Do List

ZoeLeBlanc commented 6 years ago

Tested the following methods:

Keeping notes here

ZoeLeBlanc commented 6 years ago

I have everything now running in a jupyter notebook titled ordered and unordered text. It seems like spacy and smw alignment are a bit useless but I want to look at Arabic before I remove them.

ZoeLeBlanc commented 6 years ago