bk2204 / muter

tool for converting data to and from various formats
Other
6 stars 3 forks source link

Converting stdin to hex fails on Windows #4

Closed mardukbp closed 2 years ago

mardukbp commented 2 years ago

On Windows echo adds a carriage return, which muter does not like:

> echo a | muter -c hex
610d0a
> echo a | muter -c hex | muter -c -hex
muter: invalid sequence for codec 'hex':: [13, 10]

Thanks a lot for this awesome program!

bk2204 commented 2 years ago

Hey,

Thanks for the report. I'm not able to reproduce this since I'm on Linux, but here's what I've tried:

$ printf 'a\r\n' | muter -c hex
610d0a
$ printf 'a\r\n' | muter -c hex | muter -c -hex

My guess as to what's happening here is that Windows somehow always appends a CRLF to standard output (or maybe standard input) if the format doesn't provide one. If so, that's unfortunate, but we internally have a strict and non-strict mode, and so I think we can just swallow the CRLF bytes in this case in the non-strict mode.

The Rust standard library documentation does say that it only handles UTF-8 byte sequences, which will cause problems in a bunch of cases if your data doesn't support that, but I think we can just document that as a limitation on Windows.

Anyway, I'll try to get a patch out relatively soon with a fix for the hex codec and similar codecs. There's likely a couple that will need fixing.

mardukbp commented 2 years ago

On Windows echo always appends a new line to its output. There is a way to prevent it, but that is not actually the problem. In Windows lines (in files and in the shell) end with CRLF, which are two UTF-8 byte sequences. Therefore, proper handling of textual data on Windows requires taking this into account. There is no fundamental limitation or conflict with Rust. Windows is just different than macOS and Linux. I use all three of them. That is why I am interested in muter functioning properly everywhere :)

bk2204 commented 2 years ago

echo appends a newline to its input on Linux as well. The problem you're seeing isn't echo. If it were echo, then we'd see an error from echo a | muter -c hex, which we don't. What we see here is that muter -c -hex gets a CRLF sequence, which muter -c hex doesn't emit. In fact, on my Linux system, I use zsh, and I get the following:

printf 'a\r\n' | muter -c hex
610d0a%

That final % is actually printed inverted and it's printed by zsh because there's no newline at the end of the line. muter doesn't print any line endings by default (since sometimes line endings matter or we don't print text output).

So what I need to look into, which I will once I get a temporary Windows VM set up, is why we get a needless CRLF here, which we're not supposed to. I have some ideas about how to handle that, but I have to see what Windows does in this case to do some testing.

What shell are you using in this case? Is it CMD, PowerShell, Git Bash, or something else?

mardukbp commented 2 years ago

Yes, you are right. On Linux echo appends \n to its output, unless you pass the -n flag. Like I said, on Windows echo appends \r\n to its output (in both CMD and PowerShell). So yes, it is supposed to be there. Likewise, encoding and decoding a text file on Windows fails for the same reason. On Windows lines are separated by \r\n (CRLF). Just try creating the text file muter.txt containing two lines in the Notepad and running muter -c hex muter.txt | muter -c -hex. You will get the same error.

bk2204 commented 2 years ago

muter -c -hex decodes only hex characters. It isn't designed to accept anything that is not a hex character, and the fact that it accepts a trailing newline isn't intended; in other words, it's a bug that that happens to work. I literally just discovered this fact a few minutes ago.

That's because by default muter operates in strict mode, and it's supposed to reject anything that isn't a valid character in the stream. There is a little bit of support for non-strict mode in the code, but I haven't gotten there fully yet. It's tricky because if someone inserts a very large amount of invalid characters into the stream, with the current design we might end up never making progress.

There are other codes that do accept newlines or CRLF as part of the stream, like uri, since some characters may be encoded, and others may not. Therefore, in some cases, someone could intentionally insert an LF or CRLF into the stream and want it to be an LF or CRLF However, for a hex-encoded stream, an LF or CRLF is never part of a hex-encoded stream, so they're not supposed to be allowed.

If you want to strip off trailing newlines or CRLF in the mean time, then you can do this:

$ echo a | muter -c hex | muter -c -wrap:-hex
$ echo a | muter -c hex | muter -c -crlf:-wrap:-hex

What I've found here is that this is intrinsically related to the fact that the process is being run in a PowerShell or CMD window. When I run muter in one of those shells, the pipe always contains a CRLF at the end, even though muter doesn't output one. That's not the expected behavior, and that's why this is happening. This problem doesn't occur in a Git Bash window, and so things work there.

It looks like this is a known issue with PowerShell. That's unfortunate, because muter is designed to work on streams of bytes and those bytes specifically don't have to be text at all.

I'll try to work on getting this to work a little better, but it may take me a bit of time to get this sorted finally. I do want to point out that Windows isn't a supported platform for my projects and it isn't tested there, although I'll see what I can do to make it work as well as possible.

bk2204 commented 2 years ago

Sorry it's taken me so long to get back to this. I have a branch at https://github.com/bk2204/muter/tree/crlf-improvements which should help improve some of this with the --no-strict flag. There's additional documentation in the manual page as well, outlining the example I gave how to make this work with the existing version.

bk2204 commented 2 years ago

This should be fixed with d7f9152c9b7f2f21b84bed13ac60f9c70f17688f.

mardukbp commented 2 years ago

Thanks a lot for fixing this issue! I just tested it.

echo a | muter -c hex | muter -c -wrap:-hex works as expected, but

echo a | muter -c hex | muter -c -crlf:-wrap:-hex prints the usage instructions.

bk2204 commented 2 years ago

What specific output do you get when running echo a | muter -c hex | muter -c -crlf:-wrap:-hex? Can you copy and paste the output?

mardukbp commented 2 years ago
PS> echo a | muter -c hex | muter -c -crlf:-wrap:-hex
muter
Encodes and decodes byte sequences

USAGE:
    muter.exe [FLAGS] [OPTIONS] --chain <CHAIN> [INPUT]...

FLAGS:
    -h, --help       Prints help information
    -r, --reverse    Reverse transforms in both order and direction
    -V, --version    Prints version information

OPTIONS:
        --buffer-size <buffer-size>    Size of buffer
    -c, --chain <CHAIN>                List of transforms to perform

ARGS:
    <INPUT>...    Input files to process

Modify the bytes in the concatentation of INPUT (or standard input) by using the
specification in CHAIN.

CHAIN is a colon-separated list of encoding transform.  A transform can be
prefixed with - to reverse it (if possible).  A transform can be followed by one
or more comma-separated parenthesized arguments as well.  Instead of
parentheses, a single comma may be used.

For example, '-hex:hash(sha256):base64' (or '-hex:hash,sha256:base64') decodes a
hex-encoded string, hashes it with SHA-256, and converts the result to base64.

If --reverse is specified, reverse the order of transforms in order and in sense.

The following transforms are available:
  ascii85
    bare      : do not use delimiters
  base16
    lower     : use lowercase letters
    upper     : use uppercase letters
  base32
    nopad     : do not pad incomplete sequences with =
    pad       : pad incomplete sequences with =
  base32hex
    nopad     : do not pad incomplete sequences with =
    pad       : pad incomplete sequences with =
  base64
    nopad     : do not pad incomplete sequences with =
    pad       : pad incomplete sequences with =
  bubblebabble
  checksum
    adler32   : use Adler32 as the checksum
    fletcher16: use Fletcher16 as the checksum
  crlf
  deflate
  form
    lower     : use lowercase letters
    upper     : use uppercase letters
  gzip
  hash
    blake2b   : use BLAKE2b as the hash
    blake2s   : use BLAKE2s as the hash
    length    : specify the digest length in bytes for BLAKE2b, BLAKE2s, and BLAKE3
    md5       : use MD5 as the hash
    sha1      : use SHA-1 as the hash
    sha224    : use SHA-224 as the hash
    sha256    : use SHA-256 as the hash
    sha3-224  : use SHA3-224 as the hash
    sha3-256  : use SHA3-256 as the hash
    sha3-384  : use SHA3-384 as the hash
    sha3-512  : use SHA3-512 as the hash
    sha384    : use SHA-384 as the hash
    sha512    : use SHA-512 as the hash
  hex
    lower     : use lowercase letters
    upper     : use uppercase letters
  identity
  lf
    empty     : print nothing if the input is empty
  modhex
  quotedprintable
    length    : wrap at specified line length (default 76; 0 disables)
  swab
    length    : handle chunks of this size
  uri
    lower     : use lowercase letters
    upper     : use uppercase letters
  url64
    nopad     : do not pad incomplete sequences with =
    pad       : pad incomplete sequences with =
  uuencode
  vis
    cstyle    : encode using C-like escape sequences
    glob      : encode characters recognized by glob(3) and hash mark
    nl        : encode newline
    octal     : encode using octal escape sequences
    sp        : encode space
    space     : encode space
    tab       : encode tab
    white     : encode space, tab, and newline
  wrap
    length    : wrap at specified line length (default 80)
  xml
    default   : use XML entity names
    hex       : use hexadecimal entity names for XML entities
    html      : use HTML-friendly entity names for XML entities
  zlib
bk2204 commented 2 years ago

Can you verify what version you're running? 0.7.0 doesn't even build for me on Windows.

bk2204 commented 2 years ago

Also, due to what I've found out about PowerShell's pipes and how they handle binary data, I think I'm going to declare PowerShell explicitly unsupported as an environment for this project. I don't think there's any way it can reasonably work in that environment and since trying to reproduce a problem on Windows takes about an hour of setup for me with a time-limited VM, I don't believe it's a good use of my time to try to paper over its shortcomings.

mardukbp commented 2 years ago

gettext-rs does not compile on Windows due to the usage of a tar flag that only GNU tar has. The GitHub issue is still open. Therefore, in order to compile muter 0.7.0 I replaced gettext-rs with gettext (pure Rust implementation) and commented out the init call (gettext must be initialized using a different method, which I didn't care to use).

FYI Powershell is cross-platform. I just installed it on Fedora 35 and obtained the same results as on Windows.

In Windows I use lots of Rust programs precisely because they work in every platform. So I can use the same tools also on macOS and Linux. So I know for sure that it is possible to write CLI programs that work on Powershell. Of course you are free to give up on it. Thank you anyway for all the time you have invested in this issue.

bk2204 commented 2 years ago

I understand that it's possible to write software that's cross platform. However, as a someone who uses primarily Linux, my programs are focused around Unix systems, since they're easiest for me to support and be knowledgeable about. Windows and PowerShell are very different and since I don't use them or personally care for them very much (and when I do use Windows, it's always with WSL), it's hard for me to be knowledgeable or test them. I can probably try to follow up on things in WSL, however, since the additional burden of supporting it would be minimal.

It may be that PowerShell is available for Fedora, but I use Debian, and it has not been packaged there because it contains non-free components. As such, it remains out of the possibility that I'd be able to test with it in any meaningful way on a periodic basis.

You are welcome to submit patches if you'd like to improve the experience there, and I will consider them as appropriate, but I'm unable to support PowerShell in any meaningful way. You are also free not to use Muter, of course, if you'd prefer, or to live with its limitations. Sorry we couldn't make things work out better.

mardukbp commented 2 years ago

I downloaded the PowerShell RPM from GitHub. I will try to find time to see what is going on with muter and PowerShell and submit a PR. Thanks again for taking the time to address this issue and writing all these thoughtful responses.