dagolden / Capture-Tiny

(Perl) Capture STDOUT and STDERR from Perl, XS or external programs
http://search.cpan.org/dist/Capture-Tiny/
39 stars 19 forks source link

Special case in sub _relayer() under Win32 #36

Closed klaus03 closed 8 years ago

klaus03 commented 8 years ago

I found, what I think, is a problem in Capture::Tiny under Win32. Please consider kindly my patch.

There is one special case under Windows where the sub _relayer() would require improvement.

Basically, I am using binmode(STDOUT, ':unix:encoding(utf8):crlf'); for all my Windows Perl programs. Unfortunately, Capture::Tiny doesn't correctly restore that binmode(STDOUT, ':unix:encoding(utf8):crlf');

This patch resolves this particular problem and correctly restores binmode(STDOUT, ':unix:encoding(utf8):crlf'); for all Windows programs.

Here is the background story:

There is a longstanding bug in Windows, this Windows bug shows up as the last octet repeated when Perl outputs a UTF-8 encoded string in cmd.exe, chcp 65001.

Two StackOverflow articles with basically the same problem: http://stackoverflow.com/questions/23416075 and http://stackoverflow.com/questions/25585248

When writing to a console set to code page 65001, WriteFile() returns the number of characters written instead of the number of bytes.

Workaround: Inject a binmode(STDOUT, ':unix:encoding(utf8):crlf') into any perl program.

xdg commented 8 years ago

I'm boggled by the underlying issue. Possibly @leont has some insight.

About the patch itself, I'm concerned that it's overly specific to one particular pattern of layers. For example, what if someone reverses the ":crlf" and ":encoding(utf8)" layers? Or if someone wants to use an actually secure UTF-8 layer like ":encoding(UTF-8)" or ":utf8_strict" (from PerlIO::utf8_strict?

I'd rather have something that actually deals with the state of the layers and can be smart about whether to filter out the "unix" layer or not.

Leont commented 8 years ago

Unfortunately, Capture::Tiny doesn't correctly restore that binmode(STDOUT, ':unix:encoding(utf8):crlf');

It would be helpful if you told us what happens instead. And on what versions of perl you're observing this.

Workaround: Inject a binmode(STDOUT, ':unix:encoding(utf8):crlf') into any perl program.

That sounds like an awful solution for a number of reasons; for starters you have 5 layers and are only using the top three. And it's not quite obvious to me why this would help.

I'm boggled by the underlying issue. Possibly @leont has some insight.

I'm boggled by the proposed solution!

About the patch itself, I'm concerned that it's overly specific to one particular pattern of layers.

Agreed.

klaus03 commented 8 years ago

Thanks for your replies.

I agree that my proposed patch is overly specific.

I have a strange phenomenon in my perl programs under Windows 7 (x64) using chcp 65001 where characters are mysteriously duplicated. The reason why I am so overly specific is that I have no idea why this happens, the only thing I can say is that the problem seems to go away when I use a specific binmode (":unix:encoding(utf8):crlf") on STDOUT. I apologise for this.

This is my Perl -v: This is perl 5, version 22, subversion 1 (v5.22.1) built for MSWin32-x64-multi-thread

I already discussed this question on stackoverflow: http://stackoverflow.com/questions/25585248/windows-utf-8-printed-with-chcp-65001-characters-are-mysteriously-duplicated

And I had one answer: * answered Aug 30 '14 at 19:05 by Borodin * As I suspected, this has been reported as a failure in Windows software: * This is caused by a bug in Windows. When writing to a console set to * code page 65001, WriteFile() returns the number of characters written * instead of the number of bytes. * I wasn't aware of a work-around, but if the :unix:encoding(utf8):crlf \ PerlIO stack works for you then it seems you have found one.

It seems that :unix:encoding(utf8):crlf works, but I have no idea * why * it works

I have created a gist which demonstrates the interaction between Capture::Tiny and the impact on the restored layers. https://gist.github.com/klaus03/e1910904104552765e6b

The problem (...characters are mysteriously duplicated...) shows up at markers [02] and [06] where the last two characters (...W'...) are repeated on the next line.

[02] teststr = 'IIIiUVW' W' [06] teststr = 'IIIiUVW' W'

By the way, the problem goes completely away as soon as I write to files

I hope my explanations are clear and I am always greatful for alternative solutions.

xdg commented 8 years ago

@Leont, ping. Any ideas?

xdg commented 8 years ago

@klaus03 please try the better relayering branch. It tries harder to preserve exactly what existed before (even if wacky).

I'm going to close this PR and open a new one for that branch.