boazsegev / combine_pdf

A Pure ruby library to merge PDF files, number pages and maybe more...
MIT License
733 stars 154 forks source link

How to avoid garbled characters pdf title when I see the pdf in chrome browser? #192

Closed iToshk closed 3 years ago

iToshk commented 3 years ago

I set a Japanese title via conbine_pdf like pdf.info[:Title] = "てすとdayoアイウエオ" However, in the browser, it shows like this. -> ㆦㆎㆨdayo㇢㇤㇦㇨㇪ Is there any way to avoid this?

スクリーンショット 2021-03-23 21 04 13
boazsegev commented 3 years ago

Hi @iToshk ,

The PDF specifications don't support UTF-8. Multi-lingual documents are handled by using Fong mappings where ANSI letters are mapped to the international glyph.

However, the title information has no font and no mapping, so I think reader software reads the title string as ANSI letters (maybe adding a UTF-8 BOM will fix that, but I'm not remotely sure) ...

... anyway, this was the case before PDF 2.0. I have no idea how the new standard looks like because it isn't available for free.

I have no idea how to change that.

Kindly, Boaz Segev.

iToshk commented 3 years ago

@boazsegev Thanks for the answer!! After checking out your UTF-8 BOM suggestion (unfortunately, it didn't work :( ) I had been thinking about the way without inserting the title with any gems... And I found a solution for that! I'm using send_data method for sending data and I found that the method has a url_based_filename: false option. I tried it and now the chrome browser is showing filename! ;)

Best, iToshk

iToshk commented 3 years ago

Ah, sorry url_based_filename: false didn't fix it. I used rubyXL gem for putting a title in the end.

My situation

I was dealing with PDF created from excel through libreconv. 
1) Modify excel title with rubyXL gem
↓
2) Convert it to pdf with libreconv gem (the pdf has title with non garbled characters)
The actual method in rubyXL is here. https://github.com/weshatheleopard/rubyXL/issues/395#issuecomment-818500049

It's just FYI.

conorom commented 3 years ago

I'm not sure what will happen with Chrome's reading of the title metadata but it might be worth trying UTF-16 encoding when setting it.