briandfoy / pod-perldoc

Translate from Perl's Pod doc format to other formats
https://metacpan.org/pod/Pod::Perldoc
Other
11 stars 34 forks source link

RT 111261: E<ouml> not rendering properly #53

Open briandfoy opened 11 months ago

briandfoy commented 11 months ago

From mperry@cpan.org in https://rt.cpan.org/Ticket/Display.html?id=111261


When I run "perldoc File::Spec" the resulting documentation shows Andreas König's name as "Andreas Knig".

I've verified this result using Perldoc v3.25 with Perl 5.22.0 on CentOS 6.7 and Cygwin. LANG is set to en_US.UTF-8 and TERM is xterm.

When I switch back to the system perl on CentOS the text is correctly rendered as "Andreas Koenig". That's using Perldoc v3.14_04 with Perl 5.10.1.

I was able to get the correct behavior with Perldoc v3.25 if I added "=encoding utf8" to the POD in File::Spec. However, I don't know if that's the right place to fix it or if it should be fixed in Pod::Text or in Perldoc.


From Slaven:

Also broken, with the same effect (output of "") ("ö" is here encoded as latin1):

=encoding iso-8859-1

Andreas König

=cut
rra commented 7 months ago

The differing behavior between the two versions of Perl is, I believe, because the newer Perl uses Pod::Text for rendering by default and the older Perl uses Pod::Man.

Pod::Man used to try very hard to produce pure ASCII output for maximum portability, and applied a German-specific transformation of ö to oe that is correct in some languages but not in others. This behavior changed in version 5.00, included in Perl 5.38, which now outputs UTF-8 by default since all modern nroff implementations appear to handle it correctly or at least no worse than the old behavior.

With Pod::Text::Termcap, which I think is currently the perldoc default (although with 5.00 out, the way is now clear for returning to Pod::Man as a default if desired), the situation is more complicated. The encoding rules used by Pod::Text are as follows:

  1. If a PerlIO encoding layer is set on the output file handle, do not do any output encoding and will instead rely on the PerlIO encoding layer.
  2. If the encoding or utf8 options are set, use the output encoding specified by those options.
  3. If the input encoding of the POD source file was explicitly specified (using =encoding) or automatically detected by Pod::Simple, use that as the output encoding as well.
  4. Otherwise, if running on a non-EBCDIC system, use UTF-8 as the output encoding. Since this is a superset of ASCII, this will result in ASCII output unless the POD input contains non-ASCII characters without declaring or autodetecting an encoding (usually via E<> escapes).
  5. Otherwise, for EBCDIC systems, output without doing any encoding and hope this works.

I believe the bug reporter is affected by rule 3: because the input file specifies an ISO-8859-1 encoding (or none at all, and Pod::Simple uses ISO-8859-1 by default), the output preserves that encoding.

They quite reasonably are expecting the output to honor their locale, which is set to a UTF-8 locale, but Pod::Text is not locale-aware for multiple reasons. I'm not sure if the necessary modules to figure out an output encoding based on their locale are available in Perl core so that they could easily be used by perldoc.

Honoring the locale is probably the right fix, but may be complicated. Switching to UTF-8 by default and providing a flag or some other mechanism to select an output encoding may be a reasonable fallback at this point, now that UTF-8 usage is fairly widespread. That would require plumbing through the encoding option to Pod::Text and subclasses into perldoc and setting a default value.

Alternately, perldoc could switch back to using Pod::Man by default, which is likely to already be locale-aware and may even be able to do character set transformations from the UTF-8 input that Pod::Man produces to the user's locale (although I have not tested this). I think the switch to Pod::Text::Termcap was originally because Pod::Man was awful with non-ASCII POD source, but that should be fixed in 5.00.

briandfoy commented 7 months ago

Thanks for the analysis. This might be fixed by #36 once that lands. (@xsawyerx)