atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.62k stars 350 forks source link

Does not scrape 'old style numbers' #327

Closed davidkong0987 closed 4 years ago

davidkong0987 commented 5 years ago

Hi, This might be an issue with PyPDF2 but old style numbers (aka lowercase numbers) are not properly scraped. For example for this pdf, http://wwwimages.adobe.com/content/dam/acom/en/products/type/pdfs/AdobeGaramondPro.pdf camelot.read_pdf([filepath], flavor='stream',pages='10')

returns the following

Adobe Garamond Pro Glyphs Adobe Garamond Pro’s large glyph complement was designed to fur- ther meet the exacting requirements of professional typographers and designers throughout the world. Its diverse international character set encompasses most Latin-based languages. Most of these glyphs can be easily accessed and applied with OpenType savvy applications such as InDesign.   (cid:4) ese glyphs are designed with ascenders and descenders and have features and proportions compatible with the lowercase characters of the typeface. Oldstyle fi gures, also known as hanging fi gures, typically are used for text setting because they blend in well with the lowercase. In Adobe Garamond Pro they are available in both fi tted and tabular versions.