Module idea: WASI logging

dcodeIO commented 3 years ago

I have a use case where I'd love to get rid of a custom ABI in order to switch to WASI for portability purposes, but writing to file descriptors in UTF-8 encoding exclusively doesn't map very well to my use case. So I was wondering if WASI could spec out a logging module, independently of whether the console is a terminal, a browser console or in the future perhaps sends log data over a network if someone wants to.

I am asking because console usage is an unfortunate pain point in my use case currently (bundling encoders, frequent re-encoding into dynamic allocations, potentially GCed, double re-encoding on the web and such), while everything else (like abort, random, time, etc.) would map quite well to WASI already. Just having something that isn't UTF-8 FDs would help a ton to switch to WASI while having a good feeling about it.

I'd naively imagine something like:

enum LogType {
  LOG,
  DEBUG,
  INFO,
  WARN,
  ERROR
}

enum LogEnc {
  UTF8
  WTF16
}

export namespace wasi {
  export function log(msg: usize, len: usize, type: LogType, enc: LogEnc);
}

I am aware that "logging" can be much more complex than what I outlined here of course. Perhaps "console" would be a better name, but "logging" could become more general. Also, Interface Types may eventually help here to reduce the number of arguments.

What do you think? Is this something worth exploring? (In general I'd probably have not much to complain for a while if only logging was a bit more Web-friendly. 🙂)

sunfishcode commented 3 years ago

Could you describe your goals here in more detail? Are you looking to explore whether WASI will be compatible with your goals in the long term, are you building a program or tool and looking to optimize how something runs on the Web in the short term, both, or something else?

dcodeIO commented 3 years ago

I am primarily interested in replacing the custom set of non-standard imports AssemblyScript has, like env.trace, env.seed, env.time, etc. with WASI to integrate well with both WASI-enabled hosts and an interchangeable JS polyfill, ideally in a way that using WASI doesn't negatively affect code size or efficiency of such a transition. Should work fine with most APIs, but one unfortunate obstacle there is console.log and friends, where using WASI Filesystem bloats both the module and the polyfill, which I'd like to avoid. For example, just logging a static string using WASI Filesystem will trigger inclusion of full GC support in a module currently, which is not always what one wants. So I figured that maybe WASI may be able to help by splitting say Logging out of Filesystem, which may be generally useful beyond my use case. Quite a long shot of course, but I am not overly familiar with what has already been discussed, if something like this would fall into WASI's design space, so I thought I ask :)

sunfishcode commented 3 years ago

Logging in general seems straightforward to consider, and separating out functionality like this into modules is something we're already working on. I have concerns about WTF-16 string support though.

As I mentioned elsewhere, interface types currently look like the most likely answer to how to interchange strings in WebAssembly, so that's what we're preparing for in WASI.

Using UTF-8 for now aligns with interface-types' canonical representation, so it's the closest approximation to interface types that we can get for now. And, avoiding WTF encodings means that we won't need to worry about pieces of the ecosystem coming to depend on interchanging ill-formed data, causing compatibility problems when we start migrating to interface-types strings.

Would it work for your use case if we defined a logging API that only accepted UTF-8 strings for now? I recognize it'd have some overhead for your use case, but we'd plan to address that by migrating the API to interface types as soon they become available.

Concerning the GC requirement, for the case of passing a string literal to a logging function, would it be feasible for the compiler to recognize this case, and convert the literal into UTF-8 at compile time?

Alternatively, is there a way in AS to do an "unsafe delete"? A logging API could guarantee to not let the pointer you pass it escape, so you could create a string, pass it to the log API, and then "unsafe delete" it afterwards, so it wouldn't need a full GC.

dcodeIO commented 3 years ago

Logging in general seems straightforward to consider, and separating out functionality like this into modules is something we're already working on.

👍

Alternatively, is there a way in AS to do an "unsafe delete"?

I guess there is more I can do, yeah, like resorting to malloc and free essentially for intermediate UTF-8 garbage, but that'll still trigger inclusion of the dynamic memory manager, which is one large dependency of GC. Doesn't really matter much anymore once the MM is included, I think.

Would it work for your use case if we defined a logging API that only accepted UTF-8 strings for now?

Hmm, not sure. As far as I can tell, imposing UTF-8 on languages using a different native encoding is causing most of the problem.

What do you think of adding both a let's say logln and a logln16, with the latter scheduled for removal once IT "stream of char" or similar becomes available, respectively double re-encoding on the API level is solved? In the browser polyfill, the 16 variant could then just forward to console.xy for the time being. I'd of course agree that an API like that isn't exactly nice, but perhaps it can be justified to avoid double re-encoding in the meantime? (I guess heavier APIs, like FS, that typically don't have a WTF-16 endpoint, are fine with just UTF-8 for now)

jtenner commented 3 years ago

Imagine being C# developer and not understanding that QWASI supports their string encoding type. 😂

sbc100 commented 3 years ago

For the purposes of a logging interface, is the that cost of reading a UTF8 string from an ArrayBuffer in JS really that different from reading a WTF16 string from an ArrayBuffer? Either way the JS string has to created on dynamically right, is the additional UTF8 translation to WTF16 while reading from the array really that slow? (honest question, I have not measured it).

Either way, logging interfaces should probably assume they could be writing to filesystem (which they likely be will in many cases) which in generally a slow operation, which I would have thought would dominate the UTF8 decode phase. No?

Regardless, discussions about encoding seems separable for the specific question around whether we should add a logging API.

dcodeIO commented 3 years ago

I agree that this is separable, yeah. Regarding your questions, this isn't entirely about performance. I guess the best one could do in their non-UTF-8 language is something like the following:

const staticBuf = memory.data(256);

export namespace console {
  export function log(msg: string): void {
    let size = computeUTF8Len(msg);
    if (size < 256) {
      encodeUTF8(msg, staticBuf);
      callWasi(staticBuf, size);
    } else {
      let dynBuf = heap.alloc(size);
      encodeUTF8(msg, dynBuf);
      callWasi(dynBuf, size);
      heap.free(dynBuf);
    }
  }
}

which eliminates the need for dynamic allocation of strings considered small. So, if the string is small, one would get

Compute the size
Encode to UTF-8
Call out to WASI
Potentially repeat if the return value is again a string

which some may say is fine, while others may still be a bit unhappy, depends. Note that this already pulled in some code that is only necessary due to UTF-8 everywhere, and in general is not as efficient as it could be. The pain point, however, is not that, but that there is an else, that may never execute, but still lead to the following:

Compile the dynamic memory manager
Sadness

A typical compiler may not be able to apply sophisticated optimization in an attempt to DCE the dynamic memory manager post-compilation, in turn leading to every single module doing a console.log("the bug is here") shipping the heavy machinery. I agree that in the current state of affairs one could justify that, but I'd also understand if people would not be so happy about it.

Now, even if one would attempt to DCE the MM, there is still the looming problem of what will happen with a polyfill in the browser, which is:

Compute the size
Encode to UTF-8
Call out to WASI (here: polyfill)
Encode back to WTF-16 for no good reason
Call a browser API
Sadness

Note that the latter will even be the case with the current state of Interface Types, but it has been mentioned that a "stream of char" may be able to solve this eventually. Let's see.

As I said, it still amazes me, and I am not mad or something, just trying to raise awareness towards the implications of UTF-8 everywhere that may perhaps not be on everyone's radar yet 🙂

P.S.S. I'd be happy with a logln16 for now, and then see how things develop, but a logging API is certainly useful even if my prayers for a temporary solution remain unanswered.

sbc100 commented 3 years ago

I don't think anybody here is suggesting you are mad.

Regarding the first part of your example (that part about the cost of including malloc) wouldn't it make more sense to always allocate such strings on the stack using alloca (or whatever language equivalent exists)? Does AS not have a stack in linear memory?

dcodeIO commented 3 years ago

Oh, sorry, didn't want to imply that someone suggested that. It's all fine, appreciate your input 🙂

And yeah, AS does not have a C-like stack (well, technically it has some sort of managed shadow stack now for incremental GC, but can't use it for this, it's all pointers). Instead, it exclusively relies on the Wasm execution stack in an attempt to avoid unnecessary stacks, but the Wasm execution stack is a bit limited and cannot be used as well.

sbc100 commented 3 years ago

Would it make sense to add a region of heap like llvm does for stack data? The convention that llvm uses is a wasm global called __stack_pointer which grows down. I know it doesn't solve this entire problem but it does solve the first part. I agree including malloc for this kind of things seems excessive.

sbc100 commented 3 years ago

(Doing so would also avoid stuff like const staticBuf = memory.data(256); which waste memory and won't play nice with threads.)

dcodeIO commented 3 years ago

AS uses __data_end, __stack_pointer and __heap_base, with a stack growing downwards, similar to LLVM, for the managed shadow stack, yeah. All a bit unfortunate, as the GC is precise and relies on all the data within the stack to be zeroes or pointers. One could technically implement just another stack, in a separate region, using memory.data, which is the AS equivalent of a static array, obtaining and blocking a slice of static memory, i.e. what then becomes a (data ...) segment. Possible I guess, but wondering if that could also be considered excessive for just quickly calling a WASI API. And still leaves us with double re-encoding, hmm.

sunfishcode commented 3 years ago

Many of the comments here seem to be talking about not just about logging, but about WASI APIs in general.

To be clear about one thing: strings are not WASI's problem. They're WebAssembly's problem. And what's more, WebAssembly is already working on a solution. If anyone doesn't like it, WASI isn't the place to change it.

I don't think logln16 sounds like something WASI should do at this time, in part because of the risk of "yes is forever", and in part because of the risk of this spreading beyond logging. If we're going to have a new string convention across WASI, we should really have a plan for how we want it to work. And it turns out, not only is this already on everyone's radar, there's already a plan underway.

Please keep this issue focused on logging, and please be open to suggestions specific to logging APIs.

sbc100 commented 3 years ago

If we want to take inspiration for existing APIs, it might be worth looking at what linux chose to do: https://man7.org/linux/man-pages/man3/syslog.3.html.

We might also want to consider whether we are designing a system for debugging (which is what the web's console.log/error is generally for) to event logging for things like servers and deamons which tend to have a little more structure and used in production builds. If its the former we might want to include the word "debug" somewhere in the name.

dcodeIO commented 3 years ago

The syslog interface looks good to me. Can map well to JS's debug, info, warn and error I think. Regarding simple debugging vs logging in production, perhaps both can use the same API, and we can add a flag along the lines of LOG_CONS (which is actually something else) to indicate to print to console? Alternatively, there may be separate log levels, not sure what's better.

sunfishcode commented 3 years ago

Is there a fundamental difference between logging for debugging and event logging in production, besides the log level and the consumers of the log messages? I agree that initially these seem different, but I haven't yet been able to think of a way that they're different from an application perspective.

If not, I think it makes sense to focus on figuring out what levels to have, and keep the API simple and general.

WebAssembly / WASI

Module idea: WASI logging #402