based on annotator.js, an annotation framework enable user account and annotation permission management and templating annotation plugin in biomedical domain.
Apache License 2.0
4
stars
5
forks
source link
[PDF] quote in form editor run words together #227
When I highlight and bring text into the annotation, some of them are bunched up i.e., there are no spaces between words.
Example: Wiley Kwan_1999_9987702, all quote has no space that been saved in Elasticsearch. As comparison, PDF Aldridge 2001 article (from Wiley) keep spaces in the same line but can't interpret return line char (the issue #27 ).
From Amy:
For example, Wiley articles: Aldridge, Andrus, Knudsen, Odishaw, Robertson, and Simonson 2005 run words together a few times, but all annotations for Parra, and Kwan have no spaces in between any words. Dixon had no spaces either and highlighted very oddly (blue highlights very broken up - not a solid blue highlight line like the others) so I did not save it
Analysis:
Reason:
It caused by pdf.js can't handle white space in scanned PDF and will skip return line character in mouse gripping.
Action:
(1) OCR all scanned PDF would work. Missing return line char will be fixed at mean time.
Kwan_1999 works good after OCR
Awni_1995 is scanned book that not able to annotate part of article
Issues
In some cases, OCR may incorrectly interpret content in visually hard to read document
ex. Awni_1995
Zileuton (Ahhotr-64077) is a potent inhibitor of leukotriene biosynthesis (original)
Zileuton (Ahhotr-64077) is cl potent inhibitor of leukotriene bio.,ynthesis (OCR)
ex. Kwan_1999
The concentration of the (R)-and (S)-enantiomers of warfarin in the serum (original)
The concentration of the {R)-and (S)-enantiomers of warfarin in the serum (OCR)
Workflow:
OCR scanned PDFs
Annotator highlight claim, data, material in PDF reader
2nd person manually correct OCR errors in highlighted text
add processed PDFs to AP
(2) We need manually scan though PDF documents before deliver to user
When I highlight and bring text into the annotation, some of them are bunched up i.e., there are no spaces between words.
Example: Wiley Kwan_1999_9987702, all quote has no space that been saved in Elasticsearch. As comparison, PDF Aldridge 2001 article (from Wiley) keep spaces in the same line but can't interpret return line char (the issue #27 ).
From Amy: For example, Wiley articles: Aldridge, Andrus, Knudsen, Odishaw, Robertson, and Simonson 2005 run words together a few times, but all annotations for Parra, and Kwan have no spaces in between any words. Dixon had no spaces either and highlighted very oddly (blue highlights very broken up - not a solid blue highlight line like the others) so I did not save it
Analysis: Reason: It caused by pdf.js can't handle white space in scanned PDF and will skip return line character in mouse gripping. Action: (1) OCR all scanned PDF would work. Missing return line char will be fixed at mean time. Kwan_1999 works good after OCR Awni_1995 is scanned book that not able to annotate part of article
Issues In some cases, OCR may incorrectly interpret content in visually hard to read document
ex. Awni_1995 Zileuton (Ahhotr-64077) is a potent inhibitor of leukotriene biosynthesis (original) Zileuton (Ahhotr-64077) is cl potent inhibitor of leukotriene bio.,ynthesis (OCR)
ex. Kwan_1999 The concentration of the (R)-and (S)-enantiomers of warfarin in the serum (original) The concentration of the {R)-and (S)-enantiomers of warfarin in the serum (OCR)
Workflow:
(2) We need manually scan though PDF documents before deliver to user
Reference: detect if it's scanned pdf http://blogs.adobe.com/acrolaw/2010/06/how-can-i-detect-if-a-pdf-needs-to-be-ocrd/
OCR correctness http://www.onelegal.com/blog/how-to-correct-ocr-errors-using-adobe-acrobat/