Gherkin: Delegate to Gherkin-Go prebuilt binaries

aslakhellesoy commented 6 years ago

The Gherkin parser is currently implemented in too many languages. It's too much to maintain. Each language implementation should spawn a child process (gherkin-go), read its STDOUT and deserialise the output to language-native objects using Protocol Buffers.

If you want to have a go at implementing this for one of the languages below (or a language that isn't listed here), look at #424, which implements this for Java.

[ ] Build messages library
- [x] Java
- [x] Ruby
- [x] JavaScript
- [ ] .NET
- [ ] OCaml
- [ ] Python
- [ ] Perl
- [ ] ObjC/Swift
[ ] Implement "magic file" which extracts binaries from package and/or downloads them from S3
- [x] Java
- [x] Ruby
- [x] JavaScript
- [ ] .NET
- [ ] OCaml
- [ ] Python
- [ ] Perl
- [ ] ObjC/Swift
[ ] Read and deserialise messages from gherkin-go
- [x] Java
- [x] Ruby
- [x] JavaScript
- [ ] .NET
- [ ] OCaml
- [ ] Python
- [ ] Perl
- [ ] ObjC/Swift
[ ] Delete the Gherkin parser (code, update Makefile, .rsync)
- [x] Java
- [x] Ruby
- [x] JavaScript
- [ ] .NET
- [ ] OCaml
- [ ] Python
- [ ] Perl
- [ ] ObjC/Swift
[ ] Pass all the tests :-) (make)
- [x] Java
- [x] Ruby
- [x] JavaScript
- [ ] .NET
- [ ] OCaml
- [ ] Python
- [ ] Perl
- [ ] ObjC/Swift

aslakhellesoy commented 6 years ago

FWIW there are only 348 SLOC left in gherkin-java after I deleted 10K SLOC.

What @cucumber/committers want to help out with the remaining implementations? Nothing is more satisfying than deleting code :-)

Also note that gherkin now supports a new Rule keyword (#416), so if you want that to be available in the next version of cucumber-X - this has to be done.

d-led commented 6 years ago

the bullet points are editable. Is this by design?

what about platforms where there are no gherkin-go libraries? E.g. as there's no viable Go compiler, is it out of scope?

aslakhellesoy commented 6 years ago

Go's toolchain can cross-compile binaries for most platforms, from any OS that has a Go compiler. We're currently doing that as part of the CI process. This is what we're currently building:

gherkin-darwin-386
gherkin-darwin-amd64
gherkin-freebsd-386
gherkin-freebsd-amd64
gherkin-freebsd-arm
gherkin-linux-386
gherkin-linux-amd64
gherkin-linux-arm
gherkin-linux-mips
gherkin-linux-mips64
gherkin-linux-mips64le
gherkin-linux-mipsle
gherkin-linux-s390x
gherkin-netbsd-386
gherkin-netbsd-amd64
gherkin-netbsd-arm
gherkin-openbsd-386
gherkin-openbsd-amd64
gherkin-windows-386.exe
gherkin-windows-amd64.exe

The CI process also uploads them to S3, where they are publicly accessible. Try this:

curl https://s3.eu-west-2.amazonaws.com/io.cucumber/gherkin-go/master/gherkin-darwin-amd64 -o gherkin-darwin-amd64
chmod +x gherkin-darwin-amd64
./gherkin-darwin-amd64 --json features/*.feature

If there are additional platforms we need to support I'm sure we can make it work for most of them.

aslakhellesoy commented 6 years ago

the bullet points are editable. Is this by design?

Yes, but only for committers. See https://blog.github.com/2013-01-09-task-lists-in-gfm-issues-pulls-comments/

muggenhor commented 6 years ago

Go's toolchain can cross-compile binaries for most platforms, from any OS that has a Go compiler. We're currently doing that as part of the CI process. This is what we're currently building:

This is not relevant for me at work yet, but we're targeting aarch64 (aka arm64) and I'm expecting to see more of that in the future, not less. So adding that as a target CPU seems desirable.

aslakhellesoy commented 6 years ago

@muggenhor PR for that would be much appreciated

gasparnagy commented 6 years ago

As I have mentioned earlier, having a process-execution-based parser is great for maintainability, but there are other concerns (performance, security, working offline) that make still room for native parsers. As far as I understand now the decision has been made that this repo will contain the go wrappers. Is there any advice on how/where the native parsers should be maintained (if the community wishes to do that of course)? Shall we host them under the "cucumber" organization (e.g. cucumber/gherkin-dotnet-native) or shall we move it to other organizations/users. I would keep it within cucumber for more consistency. With @SabotageAndi, we are committed to keep maintaining the .NET native parser, and personally, I would like to see a native javascript parser too (I yet have to see how I can contribute to that).

aslakhellesoy commented 6 years ago

Regarding performance - I haven't done any throrough benchmarking, but here is a small one:

# Gherkin-Java 5.1.0
time ./bin/gherkin testdata/good/*.feature > /dev/null

real    0m0.299s
user    0m0.367s
sys 0m0.068s

# Gherkin-Java 6.0.0-SNAPSHOT (master)
time ./bin/gherkin testdata/good/*.feature > /dev/null

real    0m0.313s
user    0m0.388s
sys 0m0.065s

For Java there is no noticeable degradation in performance. The go binary is very fast:

# Gherkin-Go (master)
time ./bin/gherkin testdata/good/*.feature > /dev/null

real    0m0.021s
user    0m0.007s
sys 0m0.006s

Regarding working offline: The jar currently bundles binaries for windows/linux/osx + amd64/x86_32. It only tries to download binaries from S3 for other platforms. We should probably bundle all the binaries so we never have to go online. It would increase the size of the jar a little, but disk is cheap.

@gasparnagy I hope this adresses your concerns about performance and the ability to work offline.

Let's focus on security issues next and see where that brings us. What securtity concerns do you have?

l3pp4rd commented 6 years ago

have you considered using upx for go binary compression? UPX may decrease the executable startup performance by a tiny fraction, but may reduce the binary size by more than a half, I was successfully using it for more than a year, there was one moment when it was producing a failing binary, but that was an issue on go end.

aslakhellesoy commented 6 years ago

Thanks for the upx pointer l3pp4rd - sounds like a good idea. Do you want to tweak the build to use it?

l3pp4rd commented 6 years ago

in a day or two I will be able to, if it can wait. I have a baby now, who changed my free time dramatically :) that is why I'm not that frequent contributor. but whenever I can make a small thing, I will be happy to.

aslakhellesoy commented 6 years ago

Congratulations on the baby @l3pp4rd!!!!! I don't think upx is urgent at all, take your time :-)

mxygem commented 6 years ago

Pass all the tests :-)

Are we including the Windows tests for Ruby, too ;)

aslakhellesoy commented 6 years ago

I’m referring to the shared approval tests for gherkin - comparing output with files in ‘testdata’.

mxygem commented 6 years ago

Ah, yeah. Makes sense!

mxygem commented 6 years ago

−8,020 in my PR for the Ruby stuff so far. :)

cyocum commented 6 years ago

I have a quick look over the thread. I have a couple of issues up front. First, I still need to release Cucumber.ml 1.0 as I am so close. So, my plan is to release what I have running at the moment which still uses the gherkin-c library. Once I have done that, the main feature for 2.0 will be support for the gherkin-go. I will not go into details here but there will be some jiggery-pokery with the Dune build system for OCaml. I do not think it will be much but it will still exist.

aslakhellesoy commented 6 years ago

Cool. Have a look at /.templates/s3-download as well as /.templates/*/default.mk scripts.

You can pull these into your subrepo using .rsync files and the rsync_files shell function from /scripts/functions.sh

The idea is to streamline the builds to use Make as a wrapper around platform-specific build tools.

pointo1d commented 6 years ago

a) This rework (make as a wrapper) sounds more & more like a classic autoconf/gmake use case for the provision of platform neutrality (see OpenJDK for example) - maybe a longer term goal?

b) The gherkin repo appears to conform to one of the classic CM (Configuration Management) anti-patterns whereby multiple identical copies of files (in this case testdata/) are to be found throughout a configuration; I suggest that this anti-pattern can easily be mitigated for by providing the SPoD (Single Point of Definition) of testdata/ via its' own self-contained repo, linked to the gherkin repo as a submodule - where language specific variants, should they exist, are managed using branches.

aslakhellesoy commented 6 years ago

Thanks for your insight @pointo1d. We use rsync to sync files from higher up to folders below, so maintanance is not an issue. See the rsync_files function in /scripts/functions.sh.

Git submodules are particularly cumbersome to work with when there are branches involved. This is a monorepo, so I think the solution we have now is more pragmatic.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs.

aslakhellesoy commented 5 years ago

This is mostly done now for Java, Ruby, Go and JavaScript. In January 2019 we'll remove implementations that haven't been updated from the monorepo and stop making new releases. See the Cucumber Roadmap doc for more details.

Happy to help out if you need help.

/cc @jenisys @brasmusson @pjlsergeant @gasparnagy @SabotageAndi

SabotageAndi commented 5 years ago

@aslakhellesoy I understand why you are doing this, but calling a go-binary isn't suitable in the .NET world. We need a 100% .NET parser. With .NET Core we have now a lot of more platforms to support and calling for everything the right go binary will be a maintenance nightmare for us (@gasparnagy and me). So I will not spend any time on this change. But as @gasparnagy wrote in https://github.com/cucumber/cucumber/issues/425#issuecomment-402249014 we are willing to maintain the .NET "native" parser in the future as we are doing the last years. But having no more releases when not changing to go is not an option for me.

So what are we doing with it? I am open for suggestions.

cukebot commented 5 years ago

@SabotageAndi, @gasparnagy the cross-compiled Gherkin executables are published here. For the Java/JavaScript/Ruby implementations we bundle these executables inside of the published jar/gem/npm packages and use C21e at runtime to determine which executable to run.

The C21e library is tiny and should be trivial to port to .NET Core.

Is .NET Core running on any platforms (OS/CPU architecture) that aren't covered by the cross-compiled executables?

Are there other limitations that makes it difficult to use this approach? If so, what are those limitations? I'd like to understand this before we discuss how to maintain a pure .NET implementation of Gherkin.

aslakhellesoy commented 5 years ago

From what I can tell, the platforms supported by .NET Core is a subset of the platforms Gherkin is cross-compiled for.

aslakhellesoy commented 5 years ago

(That was me accidentally commenting as cukebot above).

SabotageAndi commented 5 years ago

@aslakhellesoy We moved some parts of our Visual Extension into a separate process. And this didn't go well. So I have not really a desire to repeat that all with a central part like the Gherkin parser. And they will hit us on .NET again with this approach.

The Problems were:

Encoding Problems with StdIn/StdOut. Some customers reported problems with special chars (https://github.com/techtalk/SpecFlow/issues/1215, https://github.com/techtalk/SpecFlow/issues/1205). I had to switch to transferring the data via files, because I couldn't get the Console correct configured. Which isn't nice either.
At least one of our customers uses a white list of allowed processes to be executed. The go- binaries as our Extension files will be in a folder that is not enabled to execute everything. So the gherkin parser would suddenly not work for them anymore. And I am sure, this is not the only customer of us who has this strict security settings.

Additionally I see following problems:

We need the Gherkin parser not as .NET Core library, but as .NET Standard 2.0 library. .NET Standard is like the API levels of Android. .NET Core, .NET Framework, Xamarin (Android and iOS) and UWP are implementing this standard. The Gherkin parser has to work for all these implementations. With Xamarin and UWP you can run your tests on a device or emulator. I am not sure about if Android or iOS can execute the go-binaries, but UWP can't break out of is sandbox and execute some binary.
We use more and more the Gherkin parser at runtime also. So an output of a test project has to include these 15 files also. I think this is confusing our users, because they are not used to have dependencies as exes in their output.
Increased file size of library. The go- binaries are together 77MB. The current parser has 130kb. We will go from 130kb to 77MB in file size. How .NET packaging works, we have to put them into the NuGet package and put them into the output directories.
OS detection isn't something .NET is good at it, because we normally don't care on which OS we are running. We can probably differ between Windows, Mac and Linux, but not more via APIs. We would have to fallback to probing for the OS.

These are my biggest problems I see when we would use this approach for .NET. Sorry that I didn't said something about my concerns earlier, but I didn't understood that this will be mandatory for all Gherkin parser and all other will be removed. I thought we could simply stick to the current way for the .NET parser.

Like @gasparnagy, I would like to have the Gherkin parser for .NET under the cucumber umbrella. I have no preference if it's in a cucumber/gherkin-dotnet-native or the current gherkin-dotnet repo, but it should be in the cucumber organisation. We could also remove it from the monorepo and work in the separate repo alone. This would have the benefit to don't need the Makefile- build- process anymore and go to .NET standard tools. This would make the library more accessible for contributors that only know .NET.

aslakhellesoy commented 5 years ago

@SabotageAndi the security issues are going to be a concern. See this thread: https://groups.google.com/d/msg/cukes/ZYDs13qoVTY/OtvCvl_fBQAJ

I think we're at an impasse here.

We need to reduce the amount of duplicated development effort
Bundling cross-compiled executables will be blocked in many IT departments

I'd like to explore another avenue. Installing the executable separately. This is how WebDriver works. The various native language implementations are just thin shims that talk to a daemon running locally (chromedriver, geckodriver etc). This has to be installed separately.

Another option is to write the shared code in a language that can be compiled to many targets, such as Haxe. I don't think it's a viable option though. Few people know Haxe, so maintenance would be hard.

What do you think @charlierudolph @brasmusson @SabotageAndi @gasparnagy @mattwynne?

mpkorstanje commented 5 years ago

Installing the executable separately. This is how WebDriver works. The various native language implementations are just thin shims that talk to a daemon running locally (chromedriver, geckodriver etc). This has to be installed separately.

This makes it hard to keep projects selfcontained. It'd require some maven/gradle plugin to fetch the executable as part of the build. E.g: https://github.com/Ardesco/selenium-standalone-server-plugin

This solution would be less then perfect though. Most build systems in the java ecosystem can only download artifacts through a proxy/cache like Nexus.

aslakhellesoy commented 5 years ago

Do you have any suggestions @mpkorstanje?

mpkorstanje commented 5 years ago

I think there are exactly four options: transpile, interpret, generate and delegate. And we're evaluating all these options against the same set of constraints: should be usable and should minimize effort.

So far we've found generation and delegation to be lacking. Only interpretation and transpilation have not yet been tested.

You've added an ease of maintenance requirement. But I don't Haxe is going to any harder then it already is. We're already dealing with a parser generators for a relatively nice language. If there were more people then we wouldn't be trying to reduce the duplication of effort in the first place. So depending on how well it transpiles Haxe would be an option.

As for interpretation: I couldn't find a JavaScript interpreter for most languages.

aslakhellesoy commented 5 years ago

By delegation I assume you mean delegating to a binary executable, communicating over streams (what's in gherkin 6).

There are some problems with our current implementation:

Downloading (some malware/virus scanners don't allow downloading of binary executables)
Execution (some OSes are configured to only allow execution of "approved")

If we can address these problems, I don't think delegation is going to be a significant problem.

I propose we stop bundling the executables inside the platform-specific packages. We'll refactor the native code so it can use either of the following strategies for talking to the executable:

Over a socket (assuming it's already running)
Over STDIN/STDOUT (it would be launched just like now)

Users would have to download the binary executable in addition to installing the native package. We would need to build version negotiation into the protocol - the executable and the native package may drift out of sync.

This is the model WebDriver uses. It's a bit clunky, but it works. I can't think of a better strategy right now.

aslakhellesoy commented 5 years ago

Just adding for the record that transpilation is not an option. I don’t want to build this new infrastructure and exclude all the platforms that are unsupported by the transpiler.

mpkorstanje commented 5 years ago

I can't think of a better strategy right now.

Me neither. But I don't think it's a winning one.

mattwynne commented 5 years ago

Would it be possible to use the shared code as libraries rather than executables? Could the Go code be compiled into something that each platform could use as a library instead?

aslakhellesoy commented 5 years ago

It’s possible to compile Go code to shared libraries.

We’d still need to bundle them, so bundle sizes would be similar. Not sure if it would shut up malware detectors.

It might be worth a try!

mpkorstanje commented 5 years ago

Not sure if it would shut up malware detectors.

Should be easy to test this if we can find a jar with some native code in it. And we do!

https://jogamp.org/wiki/index.php/Downloading_and_installing_JOGL#Native_JARs_vs._native_library_files

Perhaps people can test this from behind a firewall?

https://search.maven.org/remotecontent?filepath=org/jogamp/jogl/jogl-all/2.3.2/jogl-all-2.3.2-natives-linux-amd64.jar https://search.maven.org/remotecontent?filepath=org/jogamp/jogl/jogl-all/2.3.2/jogl-all-2.3.2-natives-windows-amd64.jar

cyocum commented 5 years ago

I have been very busy lately but I have picked up on this as I have seen that languages that have not moved to the new structure will be removed in January. Honestly, I have little time right now to get Cucumber.ml into this structure. Mostly, and I hate to admit this, but I do not understand how to get the protobuf messages information from the monorepo to my repo. I have tried twice now and I am, at this point, just avoiding this because I do not have the time to figure it out on my own. I will need someone who has time to hand-hold me through the process because it seems out of my ability. If someone has the time and patience to help me, I would appreciate it but right now I do not have the time as I have another project that has my time right now. I am hoping to get it out of the way but I will need a long paring session to understand what exactly I need to do to get information that my project needs from the monorepo into mine to avoid deletion.

brasmusson commented 5 years ago

I think we should allow for a combination of approaches. The standardized test suite (feature files and corresponding golden json files) has been very successful, enabling us to growing 8 different parser implementations fulfilling the same gherkin language specification. Whatever we do we must not loose this ability. Side note: I think json golden files are better for qualifying a parser implementation than protocol buffer golden files, since it requires less of the parser implementation to show compliance to the gherkin language specification.

For the sake of argument, lets assume that we come up with an approach that works for Java, Javascript and Ruby (that is gherkin parsers to be used in Cucumber-JVM, Cucumber-JS and Cucumber-Ruby), but that approach is not suitable for the .NET environment. Then the gherkin parsers for Java, Javascript and Ruby can be developed and released together (using the mono-repo), and the gherkin for .NET can be developed and released separately still qualified for compliance to the gherkin language specification, and still hosted in the cucumber github project. The hosting in the cucucmber github project should probably require the used of the berp parser generated so that the is some consistency of implementation between the gherkin parsers hoisted here, even though not all are developed and released together.

This is similar to that not all Cucumber-JVM modules are developed and released together anymore. The non-Java language modules are released separately. Similar to a issue discussed here, this change was done to reduce the maintenance burden for the Cucumber-JVM core maintainers.

SabotageAndi commented 5 years ago

@aslakhellesoy

I'd like to explore another avenue. Installing the executable separately. This is how WebDriver works. The various native language implementations are just thin shims that talk to a daemon running locally (chromedriver, geckodriver etc). This has to be installed separately.

On .NET, Selenium and all of the webdrivers are distributed by packages. See: https://www.nuget.org/packages?q=selenium
And this is good that way. No idea how I could explain my users, that they have to install an additional tool, when you want to use some library.

Additionally, you can then only have one version installed on your system. There will be breaking changes in the IPC.
And how to you install it on third party hosted build agents (as the AzureDevOps/VSTS hosted pipelines are)?

So having a separate binary to install is also not an option in .NET.

As @mpkorstanje wrote in https://github.com/cucumber/cucumber/issues/425#issuecomment-437413076 I think it is essential, that projects are self-contained. It is really annoying for me, when I have to install a long list of additional tools/libraries to get a project to compile.

@brasmusson Don't we have this golden files here: https://github.com/cucumber/cucumber/tree/master/gherkin/testdata

About shared libraries:

AFAIK they should work on .NET Standard. I have to do some experiments. But native stuff in a .NET program makes always problems. And the size increase of the package is still there. I am still not sold, that this would be good for .NET.

brasmusson commented 5 years ago

@SabotageAndi Yes, we have the golden files, my point is just that we shall make sure not to loose them, even if we move away form that all gherkin parsers are stored in the cucumber/cucumber (mono-repo) project, and are released together.

laeubi commented 5 years ago

there are only 348 SLOC left in gherkin-java after I deleted 10K SLOC

But in contrast, the jar blows up from 342 kb to 45 MB(!), I understand that it is hard to support many parsers, but is it really the way to go to package such large binaries instead? Won'T it be better to write the parser in some kind of Parser-Language and the generate code for each language from that? Also spawning new processes might degrade performance compared to a "native" parser. This really hurts currently the cucumber-eclipse where I try to integrate the new parser, where parsing is used on several places to support the user, now parsind would include to spawn a new go-process for each parsing request.

mpkorstanje commented 5 years ago

This really hurts currently the cucumber-eclipse where I try to integrate the new parser, where parsing is used on several places to support the user, now parsind would include to spawn a new go-process for each parsing request.

Unrelated but I don't think it's worth switching to Gherkin 6 just yet for the plugin. Cucumber JVM also can't use it.

srvance commented 5 years ago

I'm looking to implement Gherkin compatibility in a new test tool and need to be able to turn feature specs to pickles in a browser. Delegating to an external Go process clearly won't work and I don't want the overhead of having to call a back end API for the functionality.

The gherkin README still documents library use, which no longer seems to exist.
What version has the older library code?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs.

mxygem commented 5 years ago

Removed stale and added Slow Burner to keep things open for now. Consider this particular issue on hold at least until Cukenfest where there'll hopefully be some more discussions about how to address things.

Sorry folks, but thank you for the patience!

Ruschnik commented 5 years ago

McAfee reported an “Artemis!ACC6568E95DA” regarding artifact .m2\repository\io\cucumber\gherkin\6.0.14\gherkin-6.0.14.jar\gherkin-go-windows-386.exe. Because of that we downloaded gherkin-6.0.14.jar from maven central and tested it with McAfee and also here McAfee reported the same threat.

Is there anything wrong with the file regarding viruses or is it a false positive?

mpkorstanje commented 4 years ago

The Gherkin parser is currently implemented in too many languages. It's too much to maintain. Each language implementation should spawn a child process (gherkin-go), read its STDOUT and deserialise the output to language-native objects using Protocol Buffers.

Informally we have decided against continuing this approach and the latest versions of Gherkin no longer include the go executables.

cucumber / common

Gherkin: Delegate to Gherkin-Go prebuilt binaries #425