Add a "Format" button in the Editor

jiegillet commented 2 years ago

For languages that have formatting tools (Elixir's mix format, Elm's elm-format, Go's gofmt, etc...), especially if the formatting tool is a community standard, it would be great to have a "Format" button on the online editor.

It would involve creating new repos for languages that want to support this.

jiegillet commented 2 years ago

I was discussing about this issue with @angelikatyborska and @neenjaw, Tim basically teased me into creating an Elixir/Phoenix app that could do this service for "all" track languages. So I threw together a very basic proof of concept here: https://github.com/jiegillet/exercism-formatter.

It works fine (for 2 hours of work put int):

➜  exercism_formatter git:(main) curl -X POST -H "Content-Type: application/json" -d '{"code" : "module Hello exposing (..) \nmain=1+1"}' http://localhost:4000/api/format/elm
{"formatted_code":"module Hello exposing (..)\n\n\nmain =\n    1 + 1\n","status":"success"}%       

➜  exercism_formatter git:(main) curl -X POST -H "Content-Type: application/json" -d '{"code" : "module Hello exposing (..) \n       main=1+1"}' http://localhost:4000/api/format/elm
{"error":"ERRORS\n\u001B[36m-- SYNTAX PROBLEM -------------------------------------------- /tmp/crkoivgjupzu\n\n\u001B[0mI ran into something unexpected when parsing your code!\n\n2|        main=1+1\u001B[31m\n          ^\u001B[0m\nI am looking for one of the following things:\n\n    end of input\n    whitespace\n\n\nProcessing file /tmp/crkoivgjupzu\n","status":"formatting error"}%           

➜  exercism_formatter git:(main) curl -X POST -H "Content-Type: application/json" -d '{"code" : "defmodule A do\n   def foo   ,  do: :ok \n end"}' http://localhost:4000/api/format/elixir
{"formatted_code":"defmodule A do\n  def foo, do: :ok\nend\n","status":"success"}%                 

➜  exercism_formatter git:(main) curl -X POST -H "Content-Type: application/json" -d '{"code" : "defmodule A do\n   def foo   ,  do: :ok"}' http://localhost:4000/api/format/elixir 
{"error":"mix format failed for file: /tmp/jmttdlzotjge\n** (TokenMissingError) /tmp/jmttdlzotjge:2:24: missing terminator: end (for \"do\" starting at line 1)\n    (elixir 1.12.0) lib/code.ex:978: Code.format_string!/2\n    (mix 1.12.0) lib/mix/tasks/format.ex:418: Mix.Tasks.Format.format_file/2\n    (elixir 1.12.0) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2\n    (elixir 1.12.0) lib/task/supervised.ex:35: Task.Supervised.reply/5\n    (stdlib 3.15) proc_lib.erl:226: :proc_lib.init_p_do_apply/3\n","status":"formatting error"}%

So I wanted to throw the idea in the ring. Of course, there are many things to consider, such as

The safety of the formatters
How flexible they are to use (ideally one simple command stdin -> stdout)
All other apps in exercism are docker-executables, not web servers
How nicely 50, sometimes overlapping, languages and language tools can coexist in one environment
How 30 maintainers can work together on one repo built with a language they may not master
How beefy the server would need to be (Elixir can go the distance of course), compared with 30 docker apps
Restricting the use to Exercism with some kind of API key

Happy to discuss more :)

joshgoebel commented 2 years ago

Restricting the use to Exercism with some kind of API key

Surely network topology can solve this issue?

I'd definitely vote for a micro-service here (assuming formatting is safe) vs spawning dockers every time a single format is needed for a single language... though it's entirely possible that a small fleet of dockers might run/host this micro-service...

iHiD commented 2 years ago

Thanks @jiegillet.

I think the main two things are:

Are the commands (e.g. mix format) 100% safe to run for all languages? If they are then things are quite simple. If they're not, then a webserver approach doesn't work, as I could take control of the webserver with a command and break everything.
It's probably much easier to have multiple docker images with the minimal setups than one large docker image, as the minimal setup work is already largely done for the test-runners.

Once we've answered (1) we'll be able to move forward, so I suspect the key route is to:

Make a list of the languages that would want this
Determine which are 100% safe to run.

Determining if something is safe is basically "Can running this execute code the user uploads". For languages that have macros as part of compilation, then the answer may be yes although it intuitively feels like it should be no. For example, lots of languages cannot safely have their compiled ASTs extracted without potentially running user-uploaded macros.

Re the hosting, if they're safe, then using Lambda to do this is trivial. You just upload the docker images and they'll just insta-spawn whenever needed and then shut down again. The extra time is only 500ms and you only pay for the time things are on. The snippet-extractor does this for example. However, if they're not safe, then all this gets much more complex and you need an infrastructure like what we have for representers.

neenjaw commented 2 years ago

If by tease you mean suggest that if you can successfully write analyzers and parser combinators then surely you can write a one-route webhook proof of concept, then yes, tease :D

SleeplessByte commented 2 years ago

Are the commands (e.g. mix format) 100% safe to run for all languages? If they are then things are quite simple. If they're not, then a webserver approach doesn't work, as I could take control of the webserver with a command and break everything.

The answer is no. There have been several CVEs across languages where specifically crafted content could break through the barrier and turn into a remote executing exploit.

That said. I do want this for JS/TS!

joshgoebel commented 2 years ago

...several CVEs across languages... break through the barrier... a remote executing exploit.

Ugh. This is why we can't have nice things. 😭

Cant we do it in browser for sure for JS/TS?

SleeplessByte commented 2 years ago

Cant we do it in browser for sure for JS/TS?

Yes. prettier runs in the browser :)

iHiD commented 2 years ago

I think the right next step is to get a couple of spikes for various languages.

I think we need:

Dockerfile for the language (based on representer is probably easiest)
Takes two arguments:
- 1. The file to read from.
- 1. The file to write to.
Normal rules apply (works offline, etc)

And then it's just some work in building out infrastructure for me. Thoughts?

iHiD commented 2 years ago

Btw, I'm presuming in this that formatting works with one file, rather than needing the whole project. Does this hold true across languages?

ErikSchierboom commented 2 years ago

I really like this idea. I could easily do a spike for C#/F#, they both have formatters that are fairly easy to use.

Normal rules apply (works offline, etc)

All the regular instructure restrictions should indeed apply (read-only file system, no networking). @SleeplessByte Do you perhaps have a link to one of the CVEs to get an idea of what the attack vector was?

Btw, I'm presuming in this that formatting works with one file, rather than needing the whole project. Does this hold true across languages?

I don't think it does. Many formatters will need another file where the formatting is specified (e.g. a .editorconfig file).

iHiD commented 2 years ago

I don't think it does. Many formatters will need another file where the formatting is specified (e.g. a .editorconfig file).

OK, cool. So we need to pass a directory as (1), not a file? And there's going to need to be some system to get the right files in again.

ErikSchierboom commented 2 years ago

Yeah, I'd say: pass in the directory like with the other tools, but additionally also pass in a filename.

SleeplessByte commented 2 years ago

Agreed with needing the directory.

@ErikSchierboom I didn't search, but I, coincidentally, had this open, which is the same type of exploit on a static analysis tool. Note that this one was because of how the plugin was written, not the underyling tool (eslint), but it's the same thing.

https://www.cve.org/CVERecord?id=CVE-2021-27081 https://msrc.microsoft.com/update-guide/en-US/vulnerability/CVE-2021-27081

junedev commented 2 years ago

I was really curious about this so I built a WebAssembly powered formatting for Go: go-fmt-wasm.netlify.app Repo with some explanation in the Readme: https://github.com/junedev/go-fmt-wasm

Would be great if someone could test this on Mac/iOS/Safari.

It was much easier than I thought. WebAssembly (and the support some languages provide for it) really came a long way in the past years. I will share some thoughts about what I think might be the challenges for making this work in production later.

ynfle commented 2 years ago

Looks great!

Works for me on Chrome and Safari on macOS.

Took a while for it to work the first time on both...

junedev commented 2 years ago

@ynfle Yes, I also had the feeling the initial wasm loading takes a while.

junedev commented 2 years ago

Some more thoughts on the WebAssembly version:

As mentioned, the initial load is quite slow.
Every language seems to have a somewhat different version of JavaScript glue code that is required to run the wasm file for that language. I could imagine that can be tricky to get all of the glue codes for the tracks into the website without interfering with each other.
Depending on the language, the glue code does not account for a file system replacement. That means if the tool you compiled to wasm would do file system operations, they would fail. In Go, I could circumvent the problem because the format tool is also provided as a package you can import not only as a command line tool. Not sure this is always the case though.

All in all, it seems still a bit early for doing the formatting in the respective language via WebAssembly. My gut feeling is that a lambda function/ docker image per language that the respective maintainers can maintain is probably the best starting point. I don't have any experience with those "format all the languages" tools mentioned above.

Go has a built-in formatter that everyone is supposed to use and requires no additional configuration file so might be an easy starting point. It would be fine with "one file in, one file out". Not sure how to go about finding out whether executing the formatter is save of not though.

I would like to challenge the assumptions that we need to pass in/provide a folder. Shouldn't everything besides the file to format be baked into the docker image already? Configuration files, all the packages needed etc. I can't really imagine a case where the formatting of a file would depend on other files in the folder or where the config would differ by exercise. :thinking:

SleeplessByte commented 2 years ago

I love that you "just build it". It works very well.

Initial load we can trick around, including making the button unavailable during first load.

iHiD commented 2 years ago

My gut feeling is that a lambda function/ docker image per language that the respective maintainers can maintain is probably the best starting point.

Just to pick up on this bit and provide context, running this via Lambda isn't possible if the tools aren't "safe" (which they won't be), as AWS reuses lambdas between sessions, so one person's code could wreck the container for the next person. So there's work in getting these running in a similar way to how test/runners/analyzers/etc work. Not huge work - but substantially more work involved than there would be if we could go down the Lambda route. What I'd probably do for this is take a simpler route than we do for submissions as we don't need persist the results etc. One thing we do need to ensure is that we don't accidentally become the "hosted formatting tools" platform for use outside of Exercism, else that could get expensive fast! So we'll need to find a way to guard against that too.

The WASM thing solves all this, which is why it's exciting as an approach. So if we can go with that, then that would be a massive win. (And it would also allow us to make reusable wasm components for this, which would be a cool wider OSS contribution!)

junedev commented 2 years ago

Ok, I didn't quite get that we already decided we need the secure test-runner like setup.

If we want to try the WASM route, I would be happy to pilot the WASM powered "Format" button for Go. As DJ said, we could visually indicate when it is ready to be pressed (disable it, show a spinner in the button, something like that). I don't know React, so someone who does would need to come up with a plan how best to integrate the glue code script and the script that initializes the wasm code into the website. Then we need a button of course and the wasm file needs to be accessible for the website, usually you put it where ever you put other assets like images. For the wasm file to work, you need to make sure it has the content type application/wasm when it is fetched. E.g. at work we put the assets on S3 (which then feeds the CDN), we needed to specify the content type when we upload the wasm file. Some systems apply the correct type automatically though.

iHiD commented 2 years ago

Maybe you, @SleeplessByte, and I pair on getting that done as a project over the festive break? It'll only take us an hour or so I imagine, but could be pretty fun! :)

SleeplessByte commented 2 years ago

Sounds like a plan!

bushidocodes commented 2 years ago

I took a look at the gofmt wasm demo. Very cool and seems to work well! My suspicion is that the initial page load is not really something that can be improved much using the standard Go toolchain. Programs that compile to wasm have to statically link the portions of their languages standard library that they use. That seems to be a fair bit for Go! It looks like hello world is ~2mb. This really varies based on the language. I did run wasm-opt (a binaryen-based compiler that can run additional optimization passes) pretty aggressively, and it only was able to shave off ~80k. I wouldn't be surprised if the Go tooling already runs this tool behind the scenes (many wasm compilers do).

Tiny Go seems to be a common means that Gophers use to generate smaller WebAssembly binaries. I'm not sure about the tradeoffs, but it might be worth taking a look to see if that can generate something functionally equivalent.

bushidocodes commented 2 years ago

For the wasm browser embedding environment, it's very true that there isn't a standard for what gets passed into a WebAssembly module. It's essentially ad-hoc and all projects generate code differently. WebAssembly on the server does have a standard API for all the normal systems type things called the "WebAssembly Systems Interface." This provides a sort of "syscall interface" and enables similar sorts of host capabilities as Docker or an OS process. Typically, there are different compiler targets for these things. Something like wasm32-unknown and wasm32-wasi. For the browser though, we'd essentially have to develop a standard "Formatter API" and then different tracks would have to develop the glue code to map this. If a formatter is hardcoded and is assumed to be run as a native process and depends on OS syscalls, things can get tricky! Emscripten essentially fakes out all these syscalls in their glue layer to make this work, mapping POSIX-style syscalls to browser APIs.

junedev commented 2 years ago

@bushidocodes Thanks a lot for having a look! I wasn't aware Tiny Go was related to WebAssembly at all but they even say that on the "box". :see_no_evil: According to their docs, they support the bit that we need:

I will check this out in more detail when I have some time.

Also thanks for that second comment, that confirms a lot of the suspicions I had when looking into this.

junedev commented 2 years ago

When using TinyGo as @bushidocodes suggested, the size of the wasm file drops from ~3MB to ~0,5MB. The time from initial page load to then the format button can be pressed was reduced from ~1s to ~0.5s. So this definitely improves the performance.

The TinyGo version is available here: ~~https://tinygo--go-fmt-wasm.netlify.app/~~ https://go-fmt-wasm.netlify.app/

The format button is now disabled until it is ready to be pressed and I added a section where it shows the loading time.

Caveat: TinyGo has a misconfiguration in their current version that leads to a stack overflow when one tries to format larger code samples. They fixed this here https://github.com/tinygo-org/tinygo/pull/2447 but they did not release a new version yet and their "build the dev branch on your own" instructions don't work for my machine. I expect the next release to be available in 1-2 month.

bushidocodes commented 2 years ago

Nice! It looks like it's only 166kb over the network due to compression. Pretty impressive speedup!

Bummer about the configuration issue! If you want to accelerate this, some tools have CLI options for stack time (stack-size=value for ld for example). This would be something like -Wl,stack-size=16384 with clang. It seems like go might expose this as -ldflags, but it seems like TinyGo might not work the same. If seems like you can use a custom target JSON file though, so you might be able to copy the JSON from the diff you posted above with the larger size into your demo repo and then use something like tinygo build ... -target ./target.json using the prebuild binary you've been using.

Possible Reference: https://github.com/tinygo-org/tinygo/issues/2108

junedev commented 2 years ago

@bushidocodes Thanks again for the help! What you suggested with the custom target configuration file worked perfectly. And it turned out that the increased value they set is not enough for our purposes so the new release wouldn't have solved our problem. Instead of doubling the original stack size I tripled it and now it works just fine even for very large chunks of code (~3000 lines).

I merged the tinygo version into main now so it's available under the main URL now: https://go-fmt-wasm.netlify.app/

Do you also have some input regarding proper error handling? Ideally an error should be thrown when the formatting does not work and then caught so that the website could indicate the failure to the user and report the error (e.g. to bugsnag). This is what I found how to do this: https://stackoverflow.com/a/67441946/5077542 Maybe you could have a look whether the explanation and code there make sense.

bushidocodes commented 2 years ago

Huh, looks like syscall/js is Go's equivalent of Emscripten glue. Pretty cool! The example basically is a wrapper that use JavaScript Promise conventions. I would imagine using syscall/js also generates a *.js file that passes the needed JS functions as WebAssembly imports that you would have to add to your app. The promise stuff makes sense from a JS perspective, but there are some software engineering issues to think through about how to handle the JS glue code since it would vary for each language. I think the solution would be to wrap the glue code into a language specific JS module that can get lazy-loaded on demand. Also, the glue code might be expensive if it doesn't tree shake functions that aren't used. For sure worth taking a look at.

Long term, there is also a phase 3 proposal to add native exception handling support to WebAsssembly. I assume that would allow thrown errors to compile to a WebAssembly primitive that the JS embedding environment could catch in a JS try/catch block without the glue code, but I've not yet used it so I'm not sure. I suspect the browser support matrix for this is very poor, but it does seem to be in Chrome. https://github.com/WebAssembly/exception-handling

exercism / exercism

Add a "Format" button in the Editor #5986