Open jmlittle opened 8 years ago
So, the issue is somewhat resolved. I am normalizing the text and also explicitly convert right single quote unicode (from word pasting) into an ascii single quote. However, for those submitting unicode I get strange single 'd' or 'y' at the beginning of random lines likely due to artifacts from other unicode content. It would be ideal if there is some basic normalization routines with from PDFBox or via the services here over time to help sanitize user-generated input that could be processed by document-builder.
@jmlittle it looks like version 2.0.0 of PDFBox will handle unicode much better. As soon as they put out a final stable release I'll definitely use it.
Also noticed that it triggers end characters duplicated on a following line. Example was a line ending in "and" and the next line beginning "d rest". Seems to be an off by one error in calc of word wrap
On Feb 1, 2016, at 5:36 PM, Craig Burke notifications@github.com wrote:
@jmlittle it looks like version 2.0.0 of PDFBox will handle unicode much better. As soon as they put out a final stable release I'll definitely use it.
― Reply to this email directly or view it on GitHub.
@jmlittle would you be able to create a new issue with a simple example of the word wrap issue?
I'll look into it. I got to find a way to create a sample project that isn't using the real text that was submitted (student record) -- but it appears that using various international word editions and pasting from them is the cause.
On Tue, Feb 2, 2016 at 7:56 AM, Craig Burke notifications@github.com wrote:
@jmlittle https://github.com/jmlittle would you be able to create a new issue with a simple example of the word wrap issue?
— Reply to this email directly or view it on GitHub https://github.com/craigburke/document-builder/issues/29#issuecomment-178652434 .
New issue submitted w/ sample groovy script
On Tue, Feb 2, 2016 at 10:08 AM, Joe Little jmlittle@gmail.com wrote:
I'll look into it. I got to find a way to create a sample project that isn't using the real text that was submitted (student record) -- but it appears that using various international word editions and pasting from them is the cause.
On Tue, Feb 2, 2016 at 7:56 AM, Craig Burke notifications@github.com wrote:
@jmlittle https://github.com/jmlittle would you be able to create a new issue with a simple example of the word wrap issue?
— Reply to this email directly or view it on GitHub https://github.com/craigburke/document-builder/issues/29#issuecomment-178652434 .
Any update on the new issue ?
On Feb 2, 2016, at 7:56 AM, Craig Burke notifications@github.com wrote:
@jmlittle would you be able to create a new issue with a simple example of the word wrap issue?
― Reply to this email directly or view it on GitHub.
Using the above code, the output is:
"þÿ I d o n t h a v e v e r y m u c h e x p e r i e n c e w i t h hacking (aside from high school robotics), but I am most willing to try".
Any user supplied 's seem to turn into empty character and mangled lines/fonts.