dbmi-pitt / dbmi-annotator

based on annotator.js, an annotation framework enable user account and annotation permission management and templating annotation plugin in biomedical domain.
Apache License 2.0
4 stars 5 forks source link

[PDF] quote in form editor run words together #227

Open ningyifan opened 6 years ago

ningyifan commented 6 years ago

When I highlight and bring text into the annotation, some of them are bunched up i.e., there are no spaces between words.

Example: Wiley Kwan_1999_9987702, all quote has no space that been saved in Elasticsearch. As comparison, PDF Aldridge 2001 article (from Wiley) keep spaces in the same line but can't interpret return line char (the issue #27 ).

From Amy: For example, Wiley articles: Aldridge, Andrus, Knudsen, Odishaw, Robertson, and Simonson 2005 run words together a few times, but all annotations for Parra, and Kwan have no spaces in between any words. Dixon had no spaces either and highlighted very oddly (blue highlights very broken up - not a solid blue highlight line like the others) so I did not save it

Analysis: Reason: It caused by pdf.js can't handle white space in scanned PDF and will skip return line character in mouse gripping. Action: (1) OCR all scanned PDF would work. Missing return line char will be fixed at mean time. Kwan_1999 works good after OCR Awni_1995 is scanned book that not able to annotate part of article

Issues In some cases, OCR may incorrectly interpret content in visually hard to read document

ex. Awni_1995 Zileuton (Ahhotr-64077) is a potent inhibitor of leukotriene biosynthesis (original) Zileuton (Ahhotr-64077) is cl potent inhibitor of leukotriene bio.,ynthesis (OCR)

ex. Kwan_1999 The concentration of the (R)-and (S)-enantiomers of warfarin in the serum (original) The concentration of the {R)-and (S)-enantiomers of warfarin in the serum (OCR)

Workflow:

  1. OCR scanned PDFs
  2. Annotator highlight claim, data, material in PDF reader
  3. 2nd person manually correct OCR errors in highlighted text
  4. add processed PDFs to AP

(2) We need manually scan though PDF documents before deliver to user

Reference: detect if it's scanned pdf http://blogs.adobe.com/acrolaw/2010/06/how-can-i-detect-if-a-pdf-needs-to-be-ocrd/

OCR correctness http://www.onelegal.com/blog/how-to-correct-ocr-errors-using-adobe-acrobat/