undocumented usage of perl for cc_test

ahippler commented 6 years ago

Description of the problem / feature request:

cc_test uses a inline perl script for failed tests. https://github.com/bazelbuild/bazel/blob/eb067ea88749a5635cc8ee8954cde2b767f1eb61/tools/test/test-setup.sh#L153

Feature requests: what underlying problem are you trying to solve with this feature?

The usage of perl is not documented. Windows does not have Perl installed by default.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

execute a failing test on windows will print:

==================== Test output for //Test/Unit:Unit (shard 1 of 8):

[...]

external/bazel_tools/tools/test/test-setup.sh: line 152: perl: command not found
================================================================================

What operating system are you running Bazel on?

Windows 10

What's the output of `bazel info release`?

0.10.1

The perl script replaces invalid XML characters and invalid sequence in CDATA. To get rid of perl bash or python could be used.

ulfjack commented 6 years ago

No, not that I'm aware of. IIUC, Buildfarm ships with an empty docker image by default, and our internal implementation is also bring-your-own-docker image. No guarantees on what's in there, and I don't see why a C++-only project (for example) would add a JRE to their remote execution docker images.

laszlocsomor commented 6 years ago

Do you know what is in that empty docker image? Is Bash part of it on Linux/macOS? Is PowerShell in it on Windows?

jmillikin-stripe commented 6 years ago

The default buildfarm image is based on gcr.io/distroless/java, which does include a JRE. Notably it doesn't include /bin/bash so it can't execute genrules. We (Stripe) use a different base image that includes Bash and Coreutils, and I expect all users of fully-remote builds will have similar custom images.

ulfjack commented 6 years ago

Right. I expect that requiring bash is fine for Linux and MacOS images. I would prefer not to require any higher-level language (like Java or Python, or even C/C++) given that there will be projects who don't want that as an additional dependency for their images. I expect that PowerShell is fine for Windows as the minimal requirement. The downside is that we end up having to maintain two implementations.

@jmillikin-stripe do you currently have Java in your remote build images? If so, why?

jmillikin-stripe commented 6 years ago

We have Java in our remote build images because https://github.com/bazelbuild/bazel-buildfarm is implemented in Java, and needs the JRE to run its _deploy.jar binaries.

See the bazel buildfarm worker.container container image target -- anyone using this standard upstream image as a basis will have Java available.

laszlocsomor commented 6 years ago

I thought about this more. Bazel requires at least Windows 7 / Server 2008 R2 as the host platform, so we could require that as a minimum execution platform too.

Compiled C++ binaries are portable across Windows versions (at least within the same architecture), though linking against MSVCRTxx.DLL requires the right DLLs to be installed on the execution machine.

AFAIK Windows Server versions (at least the Core ones) don't come with these DLLs preinstalled, but there's another option -- Win 7 / Server 2008 R2 both have .NET 3.5 preinstalled.

I don't know how the compatibility matrix looks like for .NET applications built for one version of the framework running against another version, but it seems that a C# binary built for .NET 4.0 or 4.5 can happily run on Win Server 2008 R2 (which should only have .NET 3.5). At the some time a binary linked against .NET 3.5 didn't run on my Win Server 2016 VM without wanting to install the right framework first.

Either way, .NET seems like an option we should explore further: it doesn't require any extra DLLs, we can use any .NET language (including Java, just compile it as J#), precompile test-setup in it, and bundle it with the Bazel binary.

WDYT?

laszlocsomor commented 6 years ago

(including Java, just compile it as J#)

I'm old and living under an old rock. Apparently J# is no longer a thing. (https://en.wikipedia.org/wiki/J_Sharp)

nlopezgi commented 6 years ago

sorry for the late reply @laszlocsomor JRE is not always available on remote machines. For rules that need to use it, we recommend they do via toolchain rules - not via PATH or via a hardcoded path (https://docs.bazel.build/versions/master/remote-execution-rules.html#invoking-build-tools-through-toolchain-rules).

My 2c on the wider conversation:

for SingleJar and iJar in remote execution, we have to compile them on the remote execution environment. This effectively means that, currently, to use any rules that require SingleJar or iJar, your remote container needs to have a C compiler installed. So unfortunately, a remote execution container to build e.g., only Java code, needs a C compiler.
For the record, for remote execution we highly dis-recommend using pre-compiled binaries (https://docs.bazel.build/versions/master/remote-execution-rules.html#managing-platform-dependent-binaries) even for languages that you would typically think that binaries are compatible. e.g., we have found certain Linux binary versions of the jar tool can only be used on certain versions of Linux.
All remote execution containers are expected to have bash

laszlocsomor commented 6 years ago

@nlopezgi : thanks for the info!

FYI, meanwhile I finished the draft of my design doc on Bash-less (and Perl-less) test execution on Windows, see https://github.com/bazelbuild/proposals/pull/16. In that doc I argue for using a precompiled Windows binary, because Bazel only runs on x86_64 Windows and AFAIK remote execution also only runs x86_64 Windows, so binaries are portable.

Are you suggesting I could safely choose C++ as the implementation language for Linux too, and ship the source code with Bazel, and a C++ compiler is always available on the remote execution machine to build it? If so, is that going to stay that way for the foreseeable future (say, 1 year)?

nlopezgi commented 6 years ago

I think the decision to use pre-compiled binary incurs in some technical debt that someone might end up having to pay down the line (e.g., whenever we decide to allow something else other than x86_64 Windows), but I understand there might not be other feasible solutions.

wrt using C++, we do have a restriction of currently needing C++ compiler for any builds that require use of java rules. I'm not sure we want to extend that to also be a requirement for all test rules. I do think that some language tools will always be needed to compile so called 'embedded tools' (which is a non-issue for local execution). These language tools thus will necessarily be required in remote execution containers, and a C++ compiler is a good choice as we offer a very stable and up to date one for use with remote execution containers (for Linux only, though).

laszlocsomor commented 6 years ago

Thanks.

Re: technical debt: I hear you, but your point sounds rather speculative, so I'd not worry about non-x64 Windows platforms for the foreseeable future (say 1 year).

Re: compiler: since you don't want to require it for tests, which I also agree with, I think C++ is not the right language for the test wrapper on Linux. (It still seems like the right choice on Windows.)

But then I don't know what's the right choice for Linux, if only Bash is promised (without Perl). Maybe a clever sed program could help?

laszlocsomor commented 6 years ago

@nlopezgi , another question: do you know how Bazel cancels a remotely running test action in case the user presses Ctrl+C? Does Bazel dispatch this to the remote service or does it just close the connection?

nlopezgi commented 6 years ago

re: compiler: I did not state I did not want to require it for tests, just that I was not sure its the right choice and don't want to be the one to make it w/o at least conferring with some other folks (I'll get back to you once I've confirmed). imo, if some tool is to be required for all tests, i'd rather it is the c++ compiler (instead of perl or python), but not sure what the trade-offs (effort/maintenance) between c++ vs a clever sed program would be.

re: canceling a test: not sure, you'd want to ask @ola-rozenfeld about api details

laszlocsomor commented 6 years ago

I was too optimistic with the "clever sed program". The task is to UTF-8-decode an octet-stream, test decoded characters if they fall into any of some disjoint ranges and replace them with "?", then UTF-8-encode the result again.

I'm not aware of an efficient way to do this without encoding and decoding.

ulfjack commented 6 years ago

You don't need to decode and re-encode - you can simply test on the utf-8 representation using seds regexp support.

laszlocsomor commented 6 years ago

How? Does sed support matching UTF-8 strings?

ulfjack commented 6 years ago

sed supports matching binary, and you know the utf-8 encoding, so you can check for specific utf-8 ranges, like so:

echo "ä ö ü" | LANG=C sed -e "s/[\xc0-\xdf][\x00-\xff]/?/g"

Here, I'm replacing all two-byte utf-8 sequences with a single '?' character. See https://en.wikipedia.org/wiki/UTF-8 for the multi-byte ranges.

ulfjack commented 6 years ago

After a lot of trial and error, I've come up with a sed script that - I think - does what we want:

cat test.log | LANG=C sed -E \
  -e 's/.*/& /g' \
  -e 's/(([\x9\xa\xd\x20-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xec][\x80-\xbf][\x80-\xbf]|[\xed][\x80-\x9f][\x80-\xbf]|[\xee-\xef][\x80-\xbf][\x80-\xbf]|[\xf0][\x80-\x8f][\x80-\xbf][\x80-\xbf])*)./\1?/g' \
  -e 's/(.*)\?/\1/g'

ulfjack commented 6 years ago

First, add a single white space character (' ') to the end of each line. Second, replace all (possibly empty) sequences of legal characters followed by a character with the sequence of legal characters followed by a question mark character ('?'). Third, remove the trailing question mark character ('?') from each line.

laszlocsomor commented 6 years ago

Hats off, that's quite impressive.

I think you made a couple mistakes:

you need to cover the entire 2-octet UTF-8 domain (U+80 .. U+7FF), that is "e0 80 80 .. ed 9f bf", so that part of the regex should be [\xe0-\xed][\x80-\x9f][\x80-\xbf], whereas yours is split up into [\xe0-\xec][\x80-\xbf][\x80-\xbf] and [\xed][\x80-\x9f][\x80-\xbf]
the final allowed UTF-16 range is U+10000 .. U+10FFFF, that is "f0 80 80 80 .. f4 8f bf bf", so the regex should be [\xf0-\xf4][\x80-\x8f][\x80-\xbf][\x80-\xbf], but yours is [\xf0][\x80-\x8f][\x80-\xbf][\x80-\xbf]

Could you compare your results with mine?

laszlocsomor commented 6 years ago

Sorry, I got confused, gimme a minute to correct this.

laszlocsomor commented 6 years ago

you need to cover the entire 2-octet UTF-8 domain (U+80 .. U+7FF)

This is wrong. I meant to say: you need to cover the U+80..U+7FF (two UTF-8 octets) and U+800..U+D7FF (three UTF-8 octets) ranges. The 2-octet domain is covered correctly ([\xc0-\xdf][\x80-\xbf]), but I think the 3-octet-matching regex is wrong in your solution.

ulfjack commented 6 years ago

Ok, how about this:

[\x9\xa\xd\x20-\x7f]                         <--- (9,A,D,20-7F)
[\xc0-\xdf][\x80-\xbf]                       <--- (0080-07FF)
[\xe0-\xec][\x80-\xbf][\x80-\xbf]            <--- (0800-CFFF)
[\xed][\x80-\x9f][\x80-\xbf]                 <--- (D000-D7FF)
[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf] <--- (010000-10FFFF)

ulfjack commented 6 years ago

Wait, there's still one range missing. Gah!

laszlocsomor commented 6 years ago

Ah I see where I was wrong, you are right to match 0800-CFFF and match D000-D7FF separately.

ulfjack commented 6 years ago

Another try:

[\x9\xa\xd\x20-\x7f]                         <--- (9,A,D,20-7F)
[\xc0-\xdf][\x80-\xbf]                       <--- (0080-07FF)
[\xe0-\xec][\x80-\xbf][\x80-\xbf]            <--- (0800-CFFF)
[\xed][\x80-\x9f][\x80-\xbf]                 <--- (D000-D7FF)
[\xee][\x80-\xbf][\x80-\xbf]                 <--- (E000-EFFF)
[\xef][\x80-\xbe][\x80-\xbf]                 <--- (F000-FFEF)
[\xef][\xbf][\x80-\xbd]                      <--- (FFF0-FFFD)
[\xf0-\xf7][\x80-\xbf][\x80-\xbf][\x80-\xbf] <--- (010000-10FFFF)

ulfjack commented 6 years ago

Ok, I wrote a small Java program to double-check the pattern, and it returned the expected ranges:

9-a,d-d,20-d7ff,e000-fffd,10000-10ffff

laszlocsomor commented 6 years ago

I'm glad we are free now.

ulfjack commented 6 years ago

Ok, I have a patch which conflicts with my other changes to test-setup.sh. Both are a bit risky, and we need to pick one to be merged first.

laszlocsomor commented 6 years ago

If you have changes lined up, merge those first. This bug has been open for a long time, it's OK to wait another day.

ulfjack commented 6 years ago

Patch is here: https://bazel-review.googlesource.com/c/bazel/+/68711

There are still a couple possible issues with this that we need to look into.

My primary concern is what should happen if the default charset of the current machine is NOT UTF-8. The previous Perl solution was broken as well: it did a charset conversion from the default charset to UTF-8 on input, but also from UTF-8 to the default charset on output, which can actually re-introduce illegal characters (we might want to file a bug for that, or note it on the existing bug), breaking the resulting XML.

The new code intentionally does not perform any charset conversion.

Ideally, we'd do a charset conversion from the default charset to UTF-8 before running the sed script. Note that I override the LOCALE before running sed, which probably sets the default charset to ISO-8859-1 (this needs to be double-checked!).

find a way to do charset conversion from the default charset to utf-8; we might be able to get sed to do that for us, but we need to prevent a charset conversion on output
make sure to set the default charset temporarily to ISO-8859-1 for the sed invocation
make sure that all copies of this code are updated

ulfjack commented 6 years ago

(I'm on vacation for the rest of August, and won't be able to work on this until I'm back.)

laszlocsomor commented 6 years ago

Thanks!

The previous Perl solution was broken as well: it did a charset conversion from the default charset to UTF-8 on input, but also from UTF-8 to the default charset on output, which can actually re-introduce illegal characters (we might want to file a bug for that, or note it on the existing bug), breaking the resulting XML.

Please elaborate. How and where exactly are the conversions done?

(I'm on vacation for the rest of August, and won't be able to work on this until I'm back.)

Do you intend to finish this task yourself or to appoint it to someone else (and if so, whom)?

ulfjack commented 6 years ago

If it's still open when I'm back, I'll do it. If someone comes in and finishes my changes, I'm happy, too.

Perl does implicit charset conversion on every read and write from a file or stdin/stdout. I believe it converts from the platform default charset to UTF-8 internally. That's my reading of the docs, anyway.

ulfjack commented 5 years ago

I have a pending patch.

Profpatsch commented 5 years ago

Would be awesome to see that line go. It’s the last remaining use of perl in bazel from what I can see (I already removed the other use in https://github.com/bazelbuild/bazel/pull/5999).

Go @ulfjack! :cake: :running_man:

bazelbuild / bazel