Yuras / pdf-toolbox

A collection of tools for processing PDF files in Haskell
181 stars 25 forks source link

spaces are randomly sprinkled through output #64

Open eflister opened 3 years ago

eflister commented 3 years ago

i'm reading data out of a 3 column table in some pdf weekly covid reports.  it usually works fine, but in two out of ~30 pdf's, pdf-toolbox has started sprinkling spaces into a few of the numbers apparently randomly. here's an example: randomspaces.pdf

the table covers the final few pages of the pdf.  it lists zip codes in numerical order.  most come out fine, but here's some code that prints out the lines with the extra spaces. i contrast it with pdftotext, a binding to poppler. both it and the command line pdftotext that comes with poppler show the correct output without spaces.

main = do
  let f = "randomspaces.pdf"
      check t = do
        -- skip to the table we're interested in
        let table = dropWhile (not . T.isInfixOf (T.toCaseFold "cases by ZIP")) (T.lines $ T.toCaseFold t)

        -- pdftotext sees a few more lines than pdf-toolbox, having to do with blank lines, headers/footers, etc
        putStrLn $ "\nlines: " ++ show (length table)
        -- mapM_ print table

        -- display lines with offending spaces
        putStrLn "bads:"
        mapM_ print $ filter (and . ([ T.isPrefixOf "97" . T.concat
                                     , not . T.isInfixOf "n/a" . T.concat
                                     , (/= 3) . length
                                     ] <*>) . pure) $ T.words <$> table

  -- pdf-toolbox puts random spaces in 10 different lines
  withPdfFile f $ \pdf -> check =<< extract pdf =<< catalogPageNode =<< documentCatalog =<< document pdf

  -- spaces not present in poppler bindings
  check =<< pdftotext Physical <$> fromJust <$> openFile f

extract pdf = (T.concat <$>) . (traverse ((extract' =<<) . loadPageNode pdf) =<<) . pageNodeKids
  where extract' (PageTreeLeaf tn) = putStr "." >> pageExtractText tn
        extract' (PageTreeNode tn) = do
         (putStr . show) =<< pageNodeNKids tn
         extract pdf tn

output:

lines: 385
bads:
["970","34","114","603.0"]
["97060","396","186","5.6"]
["971","33","18","450.0"]
["97210","59","5","41.9"]
["973","05","1082","2693.2"]
["97405","154","344.","9"]
["97470","73","36","5.3"]
["97520","114","4","65.9"]
["97603","200","677.","4"]
["979","13","337","6097.3"]

lines: 390
bads:
Yuras commented 3 years ago

Thank you for the bug report. There is a fuzzy logic that inserts missing spaces (and also newlines): https://github.com/Yuras/pdf-toolbox/blob/f1d20479fe6782bc293468ac05186e658216388b/document/lib/Pdf/Document/Page.hs#L293 You can get the actual glyphs using this function: https://github.com/Yuras/pdf-toolbox/blob/f1d20479fe6782bc293468ac05186e658216388b/document/lib/Pdf/Document/Page.hs#L205, but you'll have to deal with missing spaces instead :) The reason for the but seems to be failure to parse a font. I'll take a closer look later this week (I hope)

Yuras commented 3 years ago

So I checked the file. It uses the standard fonts, they don't include char widths. So we extract glyphs with incorrect bounding box. To fix it we need to include AFM files (like there) and parse them to get widths for the standard fonts. I'll try to find time for that, but it'd be faster if someone will take care of it.

eflister commented 3 years ago

thanks for looking into it! it's not urgent for me, i have other solutions, but great to know the cause :)