Get end of module span - Githubissues

eqrion commented 3 years ago

We're now going to use this crate to parse .wast scripts and convert them to specialized .js for testing in SpiderMonkey. [1]

One issue I ran into was how to output modules into JS. Ideally we'd output the module in text form, for easy debugging. The problem is how to get from a wast::Module to a text module.

I looked into two ways of doing this:

wasmprinter::print(module.encode()) - this has the disadvantage that we lose info, such as $idNames. Additionally, some spec tests are changed semantically when round-tripped like this.
Use module.span and a handwritten scanner to mind the end of the S-expr for the module in the .wast file, then copying the string from the full-span.

(2) is the approach I went with and it works surprisingly well, once the scanner was written (look for closed_module in the phab diff. But I'm a little worried this is fragile, and it feels like getting the end of the span from the parser would be a better solution.

[1] https://phabricator.services.mozilla.com/D111306

alexcrichton commented 3 years ago

I never really figure out the best way to integrate spans in wast, they're generally just a glorified usize newtype-wrapper at this point. I'm not really sure how much it makes sense to enhance them at this point since it seems like it may just be a cost without much gain.

That being said solving this issue y'all have should be easy enough, we could record, at least for a top-level module, the start/end parentheses tokens and the spans associated with them, and then if we expose the offset within the Span (which we probably should do to avoid you having to parse lines/newlines/columns/etc) then it should be possible to subslice the original string with the module start/end span?

alexcrichton commented 3 years ago

oops didn't mean to close

eqrion commented 3 years ago

Yeah, I agree that there wouldn't be much point in improving Span generally. I'd be worried about a memory size regression if we added another usize to it. Specially tracking the begin and close span's of a module would work well, I think.

One complication is that sometimes a Module is parsed without an opening (module and therefore no closing ). I don't remember where in the testsuite I ran into this, but I had to tweak the module scanner to handle it.

alexcrichton commented 3 years ago

Well I suppose another option is to say "well surely *.wat files are all less than 1G" and have Span be two u32 values to make it an equivalent size as it is today. Currently "span" is actually a misnomer because it doesn't track the end...

That'd also solve the issue where (module wasn't present because you could always join two spans into one. Something like Module could keep track of where it started parsing and where it stopped parsing to ensure that it's always got a full span for all the items

bvisness commented 1 year ago

I'm working on the SpiderMonkey test generator @eqrion mentioned, and a sizable chunk of our program's runtime is spent in Span::linecol_in. This is essentially quadratic behavior; only the offset is saved during parsing, and looking up line/column later requires another pass through the entire file. It seems like it would be pretty easy to just save the line/column during the initial parse?

(I realize this is not exactly the same issue @eqrion is reporting, but I imagine it would go hand-in-hand with any improvements to Spans.)

alexcrichton commented 1 year ago

In the test suite of this repository I ended up only calling linecol_in in the error path since it was so expensive, but I presume that's probably too costly since you're generated code that has the filename/line number in it?

Otherwise though the main worry here is the memory impact of increasing the size of Span since it's stored just about everywhere. If you can't measure much impact though then improvements are probably fine.

Alternatively I think it'd be reasonable to build a table-like data structure which could be used for faster lookup of file/line information, basically do a linear scan of the input for newlines, insert that into a sorted array, and then linecol_in, or another method, would take this as input and do a binary search to figure out the line number.

bvisness commented 1 year ago

I'll give it a try tomorrow. I doubt the memory impact will be significant enough to warrant a side data structure that would take up space on its own, plus implementation complexity. (Besides, this seems like an easy memory-versus-time trade off, to me.)

bytecodealliance / wasm-tools

Get end of module span #258