craigburke / document-builder

Groovy Document Builder
94 stars 25 forks source link

apostrophe in string turn into mangled text #29

Open jmlittle opened 8 years ago

jmlittle commented 8 years ago
OutputStream out = new ByteArrayOutputStream()

def testText = "I don’t have very much experience with hardware or hacking (aside from high school robotics), but I am most willing to try.\r"
def builder = new PdfDocumentBuilder(out)
builder.create {
        document(font: [family: 'Helvetica', size: 12.pt], margin: [top: 0.75.inches]) {

            paragraph {
                font.size = 18.pt
                text "${testText}", font: [bold: true]
            }   
        }       
        return out.toByteArray()

Using the above code, the output is:

"þÿ I d o n t h a v e v e r y m u c h e x p e r i e n c e w i t h hacking (aside from high school robotics), but I am most willing to try".

Any user supplied 's seem to turn into empty character and mangled lines/fonts.

jmlittle commented 8 years ago

So, the issue is somewhat resolved. I am normalizing the text and also explicitly convert right single quote unicode (from word pasting) into an ascii single quote. However, for those submitting unicode I get strange single 'd' or 'y' at the beginning of random lines likely due to artifacts from other unicode content. It would be ideal if there is some basic normalization routines with from PDFBox or via the services here over time to help sanitize user-generated input that could be processed by document-builder.

craigburke commented 8 years ago

@jmlittle it looks like version 2.0.0 of PDFBox will handle unicode much better. As soon as they put out a final stable release I'll definitely use it.

jmlittle commented 8 years ago

Also noticed that it triggers end characters duplicated on a following line. Example was a line ending in "and" and the next line beginning "d rest". Seems to be an off by one error in calc of word wrap

On Feb 1, 2016, at 5:36 PM, Craig Burke notifications@github.com wrote:

@jmlittle it looks like version 2.0.0 of PDFBox will handle unicode much better. As soon as they put out a final stable release I'll definitely use it.

― Reply to this email directly or view it on GitHub.

craigburke commented 8 years ago

@jmlittle would you be able to create a new issue with a simple example of the word wrap issue?

jmlittle commented 8 years ago

I'll look into it. I got to find a way to create a sample project that isn't using the real text that was submitted (student record) -- but it appears that using various international word editions and pasting from them is the cause.

On Tue, Feb 2, 2016 at 7:56 AM, Craig Burke notifications@github.com wrote:

@jmlittle https://github.com/jmlittle would you be able to create a new issue with a simple example of the word wrap issue?

— Reply to this email directly or view it on GitHub https://github.com/craigburke/document-builder/issues/29#issuecomment-178652434 .

jmlittle commented 8 years ago

New issue submitted w/ sample groovy script

On Tue, Feb 2, 2016 at 10:08 AM, Joe Little jmlittle@gmail.com wrote:

I'll look into it. I got to find a way to create a sample project that isn't using the real text that was submitted (student record) -- but it appears that using various international word editions and pasting from them is the cause.

On Tue, Feb 2, 2016 at 7:56 AM, Craig Burke notifications@github.com wrote:

@jmlittle https://github.com/jmlittle would you be able to create a new issue with a simple example of the word wrap issue?

— Reply to this email directly or view it on GitHub https://github.com/craigburke/document-builder/issues/29#issuecomment-178652434 .

jmlittle commented 8 years ago

Any update on the new issue ?

On Feb 2, 2016, at 7:56 AM, Craig Burke notifications@github.com wrote:

@jmlittle would you be able to create a new issue with a simple example of the word wrap issue?

― Reply to this email directly or view it on GitHub.