dealfonso / sapp

Simple and Agnostic PDF Document Parser in PHP - sign PDF docs using PHP
GNU Lesser General Public License v3.0
116 stars 30 forks source link

UTF-8 Support #78

Open krupong opened 1 month ago

krupong commented 1 month ago

I am testing the signature -> set_metadata_props feature , but it's not show correctly. My signing reason is "ทดสอบ"

Screenshot_20240729_143316

Is it support utf-8 encoding? Thank you.

erikn69 commented 1 month ago

Try #79

krupong commented 1 month ago

Hello,I've try

Try #79

It's truncate some character such as "ภาษาไทย" will return "ภา".

ภาพ

So I've change from :

return "\xFE\xFF" . mb_convert_encoding($string, 'UTF-16BE', $encoding);

TO :

return "\xEF\xBB\xBF".mb_convert_encoding($string, 'UTF-8', $encoding);

It's show correctly. ภาพ

Thank you.

erikn69 commented 1 month ago

So I've change from : return "\xFE\xFF" . mb_convert_encoding($string, 'UTF-16BE', $encoding); TO : return "\xEF\xBB\xBF".mb_convert_encoding($string, 'UTF-8', $encoding);

with that change I get this

image

dealfonso commented 1 month ago

What about using a custom encoded string when setting the metadata?

erikn69 commented 1 month ago

What about using a custom encoded string when setting the metadata?

That would work, but there would be the problem that every time someone doesn't know that they should do their own encoding, they will have problems and open a new issue.

erikn69 commented 1 month ago

@dealfonso One question, if the file says ANSI in the encoding, and the reason is in UTF-8 or another encoding, wouldn't this problem occur?

Look, I sent UTF-8 and it doesn't work

/Reason(ภาษาไทย)/Location(sdfs ó í í)>>

But I did send ISO-8859-1

/Reason(ó í í {} ` ~)/Location(sdfs ó í í)>>
dealfonso commented 1 month ago

Honestly, I have not considered this topic before. A quick search on google [1] tells me that PDF seems not to consider character encoding in a general form. It considers that the encoding depends on the font, and depending on the font, the same character will show a representation or another.

I don't know how this applies to the reason and so on.

That is why my "quick answer" is that pdf does not support utf-8 and so the users needs to encode the characters depending on their needs.

I'll read more about character encoding in the metadata. Do you have any source of info to read?

https://www.gnostice.com/nl_article.asp?id=383&t=Font_and_Encoding_Standard_types_supported_in_PDF_for_the_representation_of_text_content

erikn69 commented 1 month ago

It considers that the encoding depends on the font, and depending on the font, the same character will show a representation or another

But on text contents, metadata don't use fonts

erikn69 commented 1 month ago

I did try FPDF, and it works with UTF-8,

/Keywords (þÿ Ì + ^ ì ò Ò ê)

But here doesn't work https://github.com/Setasign/FPDF/blob/0838e0ee4925716fcbbc50ad9e1799b5edfae0a0/fpdf.php#L1169C1-L1189C2

krupong commented 1 month ago

I try to sign with TCPDF, It work with UTF-8 too. When open in VS-Code :

ภาพถ่ายหน้าจอ 2567-07-31 เวลา 11 24 39

Sign with sapp, seem store as plain text : ภาพถ่ายหน้าจอ 2567-07-31 เวลา 11 30 07