PhilterPaper / Perl-PDF-Builder

Extended version of the popular PDF::API2 Perl-based PDF library for creating, reading, and modifying PDF documents
https://www.catskilltech.com/FreeSW/product/PDF%2DBuilder/title/PDF%3A%3ABuilder/freeSW_full
Other
6 stars 7 forks source link

Text handling oddities #193

Open PhilterPaper opened 1 year ago

PhilterPaper commented 1 year ago

I just ran across something odd with TrueType fonts ($pdf->ttfont(...)). It appears that word spacing ($text->wordspace(n)) is ignored for TrueType fonts. The PDF::Builder call itself seems to work OK, leaving a n Tw command in the stream. However, Adobe Acrobat Reader seems to ignore the Tw command -- I need to find some other readers to test on. The character spacing command n Tc ($text->charspace(n)) appears to work properly for TrueType fonts.

I tested with corefonts and psfonts (Type 1 fonts) and they both work properly with both word and character spacing. I wonder if the problem is that Tw is implemented to look for an ASCII space (x20) only, to adjust its size, and misses the boat on the glyph ID hex codes used with ttfonts? Certainly, the hex code for a space glyph can vary widely!

I need to find out if this is something peculiar to Adobe, or if it's widespread. Either way, the wordspace() method's limitation will have to be documented. I haven't checked yet to see if the order of commands matters.

Add: A workaround for this, assuming it isn't a bug in PDF::Builder itself, would be to output words individually, using some multiplier on the actual space width:

# close up a sentence with 40% width spaces, for a TTF font (in lieu of wordspace)
ws = text->advancewidth(' ') * 0.4;
phrase = 'The';
x = starting_x;
w = text->advancewidth(phrase);
text->text(phrase);  # outputs
x += w + ws;
phrase = 'New';
w = text->advancewidth(phrase);
text->text(phrase);
x += w + ws;
... etc. ...

There might be more elegant ways, if I think about it for a bit. And of course, it could be a loop to split up a single run of words and spaces, or even a build in a method to do this. Something like this may need to be added to all the text output methods, including column(). I'd appreciate hearing from others if they've also seen this problem, and suggestions on what to do about it. Is there a mechanism for reporting this to Adobe? The Reader might not know which glyph corresponds to a space, but it could potentially see a character with no ink (not just x20) and apply a multiplier to it if Tw is in use.

PhilterPaper commented 1 year ago

I learned something else today about fonts. While it's true that Linux etc. variants place their fonts in all sorts of locations, Windows isn't as pure as I thought it was. When you add a new font, say, by dragging and dropping a .ttf file into \Windows\Fonts, there's no guarantee that it will end up there! Its name also will often be changed. This knowledge is important for knowing the font path and file name for using a TrueType font.

To find out where your TTF or OTF file ended up, if you don't see an obvious entry in \Windows\Fonts, you need to look in \Users\XXXX\AppData\Local\Microsoft\Windows\Fonts, depending on what user you were signed on as when you installed the font. Even then, you may not be done, as the name may have been changed to something unrecognizable. You may need to look at Windows' mapping of font name to filename.

In the command shell (command line), or whatever equivalent you like to use, enter "regedit" to bring up the registry editor. For the top level, choose (click on) either HKEY_LOCAL_MACHINE (for global font settings, in \Windows\Fonts) or HKEY_CURRENT_USER (for fonts installed by whoever is currently signed on, in \Users\XXXX\AppData...). From there, both have the same path: SOFTWARE > Microsoft > Windows NT > CurrentVersion > Fonts. This should bring up a listing of all the installed fonts (full name, e.g. "Papyrus Regular") and their actual filename ("PAPYRUS.TTF"). For instance, I just installed a blackletter "Gothic" font English Towne Medium. It ended up in the \Users\Phil... directory as EnglishTowne.ttf.

You don't need to change anything in the registry, just look. You do have the capability to change things, including hiding/showing the font, if you care to get into those things.

Anyway, this should give you the information you need to get the proper path and file name for TTF fonts you install (and even those that come with Windows). Other font types don't seem to jump through these hoops. At some point, this should probably go into the ttfonts() method documentation, and perhaps a mention in FontManager.

Credit: much of this information came from https://superuser.com/questions/1658678/detect-path-of-font-on-windows

mkl-public commented 1 year ago

As discussed on the Adobe Support Community site, this is a matter of the encoding the PDF creator uses for the font in question:

Word spacing shall be applied to every occurrence of the single-byte character code 32 in a string when using a simple font (including Type 3) or a composite font that defines code 32 as a single-byte code. It shall not apply to occurrences of the byte value 32 in multiple-byte codes. (ISO 32000-2:2020 section 9.3.3 Word spacing)

Thus, if you want to use the Tw instruction to manipulate the spacing between words, you have to use an encoding for your font which uses the single-byte 32 character code for the space glyph.

PhilterPaper commented 1 year ago

Regarding the Tw/wordspace issue, follow along on here. The bottom line (so far) is that there is no way when glyph IDs are used for TrueType fonts that it will ever support Tw. Plus, it will always be only for ASCII spaces (x20) and not required blanks (xA0) or other kinds of spaces.

I will have to think about adding a hack to split up a $text->text($sentence) call into individual words, and place each one with an emulated space of adjusted width. Until then, the wordspace() method needs a warning.

  1. Should this be done for all flavors of Unicode space? PDF's Tw is hard-wired to handle only ASCII space (x20), so required blanks/non-breaking spaces and various sizes of spaces could be proportionately adjusted. There could certainly be an option to apply only to x20 and xA0. Maybe xA0 should be changed to x20 anyway?
  2. Should this be done for all font types, and not just TTF? If so, non-ASCII spaces would all be handled the same way, and PDF would never see an ASCII space character (unless wordspace is set to 0). I would have to query the font type, if not.
  3. Presumably this should be built in to all text output routines (I think they all eventually come to $text->text()), including the new column(). They would have to check if the Tw value requested is non-zero, before going through all the bother.
  4. If column() supports it, I will need a new fake-HTML tag and/or CSS to change Tw (and Tc) on the fly (as well as recognizing their being set upon entry).
PhilterPaper commented 1 year ago

Thus, if you want to use the Tw instruction to manipulate the spacing between words, you have to use an encoding for your font which uses the single-byte 32 character code for the space glyph.

I don't know why you keep insisting (here and on the Adobe forum) that I am using a multibyte character encoding for the text. It's not. The original "space" is a single byte x20. For TTF support in PDF::Builder, the Reader is presented with a list of glyph IDs, which will vary by the particular font being used. A 'space' (x20) may end up 0003 in one font file and 00b7 in another. If the Reader is searching for an actual byte of x20, it ain't gonna find it. This is a limitation of the Reader implementation, in that it doesn't go looking for inkless glyphs (spaces) when presented with a glyph ID list rather than a text string (where a space is x20). My complaint is that I don't see this limitation documented, except in a very round-about way.

mkl-public commented 1 year ago

I don't know why you keep insisting (here and on the Adobe forum) that I am using a multibyte character encoding for the text. It's not. The original "space" is a single byte x20.

You misunderstand what the PDF specification means when it talks about multibyte character codes.

It does not talk about the character encoding you use in your application before you transform some text strings into content streams. It doesn't care what encoding you use in your application code.

What it talks about is what you eventually store in the strings (literal of hexadecimal) in the content streams. And as you use Identity-H as font encoding, you store doublebyte codes there.

With this misunderstanding cleared up, the excerpt from the specification I quoted above requires a PDF viewer to operate like Adobe Acrobat does in this regard, and it does so in a clear way.

PhilterPaper commented 1 year ago

I have updated PDF::Builder to honor the Tw setting when using a TrueType font. This will hit CPAN with the 3.026 release. It splits out x20 ASCII spaces and gives them their own kerning, to adjust their width. Note that $text->textHS() and $text->advancewidthHS() (both for HarfBuzz::Shaper use) do not yet (?) honor Tw. Perhaps in the future...

Add: PDF::Builder 3.026 has been out for a while now (and 3.027 soon). Note that only x20 (regular space) is handled, not xA0 Required Space (yet) or any various-width spaces (if they have their own Unicode codepoint). Also, this is only within regular PDF::Builder processing, and not the HarfBuzz routines (yet).

PhilterPaper commented 3 weeks ago

@mcitterio, I see you have ssimms/pdfapi2/issues/81 open for this problem. You might give PDF::Builder a try, as I believe I have fixed the issue here.

mcitterio commented 3 weeks ago

I gave it a try. I replaced PDF::API2 with PDF::Builder in my code and it worked with these issues:

PS: Actually my code using PDF::API2 and some patches I made outputs a fully compliant PDF/A-3B tested with veraPDF with logos , ttf fonts , justifications , rgb output intents, file attachments and digital signs.

PhilterPaper commented 3 weeks ago
mcitterio commented 3 weeks ago

try this! adapt to your system , find Liberation-Sans font. the core issue is: the second time you call $page->text paragraph seems using wrong line width. If you comment the first call to $page->text all become fine.

somefile.pdf

      use PDF::Builder;
       my $filepath = "somefile.pdf";
       my $pdf = PDF::Builder->new();  
       my $font = $pdf->font('LiberationSans-Regular.ttf');                        
       my $clrblack = '#000';
       my $page = $pdf->page();             
        $page->size('A4');  # 595*833 points

        my $content1 = $page->text();
        ## do something with $content1  or not, for example:
        #$content1->fill_color($clrblack);
        #$content1->textlabel(297.5,820,$font,10,"some text in arbitrary place",align=>'center');

        my $content2 = $page->text();
        $content2->fill_color($clrblack);
        $content2->font($font,10);
        $content2->leading(12);
        $content2->position(30,720);
        $content2->text("Title" ,align=>'left');
        $content2->crlf();
        my $textbody = "Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua. ".
                    "Ut enim ad minim veniam, quis nostrum exercitationem ullamco laboriosam, nisi ut aliquid ex ea commodi consequatur. Duis ".
                    "aute irure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat ".
                    "cupiditat non proident, sunt in culpa qui of ficia deserunt mollit anim id est laborum. Lorem ipsum dolor sit ".
                    "amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.";
        my ($overflow, $height) = $content2->paragraph($textbody,530,60,align=>'justified');
        $pdf->saveas($filepath);
mcitterio commented 3 weeks ago
* Does your patch emulate the **Tw** command, which is ignored for TTF/OTF when glyph IDs are given instead of a text string? If you found a case or two that shows incorrect justification, please feel free to open a ticket and show the examples.

let we see what happens in PDF::API2::Resource::CIDFont->text

you will find that code

...
       if (defined($indent) and $indent != 0) {
            return "[ $indent $newtext ] TJ";
        }
        else {
            return "$newtext Tj";
        }
...

this syntax "$newtext Tj"; could be justified using wordspace but only with single byte coded font ( no TTF nor OTF)

this "[ $indent $newtext ] TJ"; could be justified computing an indent between every word
my $indent = $self->advancewidth(' ') + (($width - $self->advancewidth($line)) / $space_count

first word had to be without indent every space char has to be replaced by split

mcitterio commented 3 weeks ago

same syntax for example [ (This) 120 (is) 120 (a) 120 (justified) 120 (text) 120 (example)] TJ is used by Acrobat Pro where you choose to make a justified text block where ascii chars are to be replaced by Identity-H codes

mcitterio commented 3 weeks ago

so Tw is the past , better use TJ syntax

see PDF1.4 spec point 5.3.2 or ISO32000 PDF1.7 spec point 9.4.3

mcitterio commented 3 weeks ago

I took a look at PDF::Builder::Resource::CIDFont->text

and to PDF::Builder::Page->graphics documentation

It is possible to use multiple graphics objects, to avoid having to change
settings constantly, but you may want to consider resetting all your settings 
at the first call to each object, so that you are starting from a known base.
This may most easily be done by using $I<type>->restore() and ->save() just
after creating $I<type>:

    $text1 = $page->text(); 
      $text1->save();
    $grfx1 = $page->gfx();
      $grfx1->restore();
      $grfx1->save();
    $text2 = $page->text();
      $text2->restore();
      $text2->save();
    $grfx2 = $page->gfx();
      $grfx1->restore();

I will try this.

mcitterio commented 3 weeks ago

No do not solve. Only have correct justification for the first text object of every new page.

so you have to

$page = $pdf->page();
$textjust = $page->text();
$gxf = $page->graphics();
$textother = $page->text();

with $textjust justified text will work for $textother not

Hope this will help , sorry for my verbosity.

PhilterPaper commented 3 weeks ago

I did have text full justification working, but something seems to have happened to the code. I will find it and fix it (or put it back in) today or tomorrow.

PhilterPaper commented 2 weeks ago

I have fixed the code I put in January 2023 for text justification when using TTF/OTF fonts, where the Tw command is emulated by adding to inter-word spaces. It looks like a later change prevented the word-spacing and font-size settings from getting passed to CIDFont.pm, which does the text output for TTF/OTF. Give it a try.