danfickle / openhtmltopdf

An HTML to PDF library for the JVM. Based on Flying Saucer and Apache PDF-BOX 2. With SVG image support. Now also with accessible PDF support (WCAG, Section 508, PDF/UA)!
https://danfickle.github.io/pdf-templates/index.html
Other
1.92k stars 357 forks source link

Wrong Position of Accents for Sequences of DIN 91379 #777

Open vk-github18 opened 3 years ago

vk-github18 commented 3 years ago

Wrong position of accents for sequences defined in DIN 91379

Describe the bug

The position of the accents is incorrect for most of the character sequences defined in the following specification:

DIN SPEC 91379: Characters in Unicode for the electronic processing of names and data exchange in Europe; with digital attachment https://www.xoev.de/downloads-2316#StringLatin https://www.din.de/de/wdc-beuth:din21:301228458

E.g. with 0041 030B LATIN CAPITAL LETTER A WITH COMBINING DOUBLE ACUTE ACCENT the accent appears at the right hand side of the letter A, not above the letter A.

To Reproduce

Render Din91379-Letters.html and Din91379-List.html with OPEN HTML TO PDF.

Expected behavior

The correct rendering should look like the output of HarfBuzz hb-view 2.9.1 for Din91379-Sequences.txt, see Din91379-Sequences.png. HarfBuzz uses the info in the OpenType GPOS table for the positioning of combining diacritical marks.

hb-view.exe -o Din91379-Sequences.png NotoSans-Regular.ttf < Din91379-Sequences.txt See https://github.com/harfbuzz/harfbuzz.

Screenshots

Rendering with OPEN HTML TO PDF

image

Rendering with HarfBuzz

Din91379-Sequences

System (please complete the following information):

OS: Windows 10 Used Font: NotoSans, NotoSansMath, see https://github.com/googlefonts/noto-fonts/tree/main/hinted/ttf/NotoSans, https://github.com/googlefonts/noto-fonts/tree/main/hinted/ttf/NotoSansMath

Additional context

See also https://issues.apache.org/jira/browse/PDFBOX-4951 https://github.com/LibrePDF/OpenPDF/issues/442 https://issues.apache.org/jira/browse/FOP-2969 googlefonts/noto-fonts#1882

Files

Letters of DIN91379

din91379_letters_all.txt din91379_list_all.txt Din91379-Sequences.txt

HTML-Files

Din91379-Letters.html Din91379-List.html

PDF-files rendered with OPEN HTML TO PDF

Din91379-Letters.html.pdf Din91379-List.html.pdf

Java program to reproduce the bug

Test1.java

syjer commented 3 years ago

I would guess it's the same issue as https://github.com/danfickle/openhtmltopdf/issues/763

vk-github18 commented 3 years ago

Yes, both issues suffer from the lack of a text shaping engine like HarfBuzz. It should be possible to implement the change I proposed in https://issues.apache.org/jira/browse/PDFBOX-4951, Comment 28. Nov 2020 at the interface from OPEN HTML TO PDF to PDFBox -- no change of PDFBox required.

danfickle commented 3 years ago

I've started work on modernizing the advance shaping PR for pdfbox at danfickle/pdfbox.

The files are under:

It is very early stage but as a proof-of-concept this is what I'm producing: image

vk-github18 commented 2 years ago

@danfickle Glad to here that you are working at the support of advanced glyph layout. You chose the hard way, to implement all the needed functionality, while I proposed to use the glyph layout provided by the Java platform.

I tried to layout the sequences of DIN91379 with the code in AdvancedTextLayout example but failed, because the font NotoSans-Regular could not be loaded. This font has IMHO at the moment the best support of DIN91379 under the freely available fonts.

The error occurs at calling OpenTypeFont otFont = fontParser.parse(fontFile); for Noto Sans Regular: java.lang.UnsupportedOperationException: coverage set class table not yet supported at org.apache.fontbox.ttf.advanced.GlyphClassTable$CoverageSetClassTable.(GlyphClassTable.java:262) at org.apache.fontbox.ttf.advanced.GlyphClassTable.createClassTable(GlyphClassTable.java:95) at org.apache.fontbox.ttf.advanced.AdvancedTypographicTableReader.readGDEFMarkGlyphsTableFormat1(AdvancedTypographicTableReader.java:3371) at org.apache.fontbox.ttf.advanced.AdvancedTypographicTableReader.readGDEFMarkGlyphsTable(AdvancedTypographicTableReader.java:3384) at org.apache.fontbox.ttf.advanced.AdvancedTypographicTableReader.readGDEF(AdvancedTypographicTableReader.java:3447) at org.apache.fontbox.ttf.advanced.AdvancedTypographicTableReader.read(AdvancedTypographicTableReader.java:136) at org.apache.fontbox.ttf.advanced.GlyphDefinitionTable.read(GlyphDefinitionTable.java:105) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:399) at org.apache.fontbox.ttf.TrueTypeFont.getTable(TrueTypeFont.java:183) at org.apache.fontbox.ttf.OpenTypeFont.getGDEF(OpenTypeFont.java:123) at org.apache.fontbox.ttf.advanced.AdvancedTypographicTableReader.initializeGPOS(AdvancedTypographicTableReader.java:3551) at org.apache.fontbox.ttf.advanced.AdvancedTypographicTableReader.readGPOS(AdvancedTypographicTableReader.java:3501) at org.apache.fontbox.ttf.advanced.AdvancedTypographicTableReader.read(AdvancedTypographicTableReader.java:140) at org.apache.fontbox.ttf.advanced.GlyphPositioningTable.read(GlyphPositioningTable.java:106) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:399) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:187) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:164) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:91) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:101) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:86) at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73) at org.apache.pdfbox.examples.pdmodel.AdvancedTextLayoutSequencesDin91379.testAdvancedLayout(AdvancedTextLayoutSequencesDin91379.java:199) at org.apache.pdfbox.examples.pdmodel.AdvancedTextLayoutSequencesDin91379.main(AdvancedTextLayoutSequencesDin91379.java:72)

vk-github18 commented 2 years ago

Trying to load DejaVuSans with OTFParser results in: java.lang.UnsupportedOperationException: TTF fonts do not have a CFF table at org.apache.fontbox.ttf.OpenTypeFont.getCFF(OpenTypeFont.java:73) at org.apache.fontbox.ttf.OpenTypeFont.getPath(OpenTypeFont.java:92) at org.apache.pdfbox.pdmodel.font.TrueTypeEmbedder.createFontDescriptor(TrueTypeEmbedder.java:251) at org.apache.pdfbox.pdmodel.font.TrueTypeEmbedder.(TrueTypeEmbedder.java:75) at org.apache.pdfbox.pdmodel.font.PDCIDFontType2Embedder.(PDCIDFontType2Embedder.java:76) at org.apache.pdfbox.pdmodel.font.PDType0Font.(PDType0Font.java:116) at org.apache.pdfbox.pdmodel.font.PDType0Font.load(PDType0Font.java:192) at org.apache.pdfbox.examples.pdmodel.AdvancedTextLayoutSequencesDin91379.testAdvancedLayout(AdvancedTextLayoutSequencesDin91379.java:207) at org.apache.pdfbox.examples.pdmodel.AdvancedTextLayoutSequencesDin91379.main(AdvancedTextLayoutSequencesDin91379.java:75)

vk-github18 commented 2 years ago

I added a little test to https://github.com/vk-github18/pdfbox examples/src/main/java/org/apache/pdfbox/examples/pdmodel/AdvancedTextLayoutSequencesDin91379.java to compare the computing of the layout vector and the rendering of glyphs with Java2D and AdvancedTextLayout for some fonts. The layout vector is nearly identical (taking a factor of 50 into account). The rendering is surprisingly different.

vk-github18 commented 2 years ago

The error "java.lang.UnsupportedOperationException: coverage set class table not yet supported" is solved by applying the following FOP patch: https://github.com/apache/xmlgraphics-fop/commit/551007e7e0f14b85dfd8f33d3cf8a4e1635c09cd

vk-github18 commented 2 years ago

I did some prototyping based on your branch of PDFBox, see https://github.com/vk-github18/pdfbox. The resulting positioning looks good. Only when one base letter has two combining diacritics, the second one is positioned wrong. In this case the positioning information in the layout vector ist wrong. I will clean this up and prepare a pull request for danfickle/pdfbox in the next days.

image

vk-github18 commented 2 years ago

You find the pull request in https://github.com/danfickle/pdfbox/pull/2 I also started a pull request for PDFBox, see https://github.com/apache/pdfbox/pull/143 and a discussion in https://issues.apache.org/jira/projects/PDFBOX/issues/PDFBOX-4951?filter=allopenissues