Investigate switching back to &[u8] from custom span

chipsenkbeil / vimwiki-rs

Rust library and tooling to parse, render, and modify vimwiki text and files.

56 stars 2 forks source link

Investigate switching back to &[u8] from custom span #56

Closed chipsenkbeil closed 4 years ago

chipsenkbeil commented 4 years ago

There are a couple of reasons:

Complexity of maintaining my span or the nom_locate span
Additional cost when slicing either span in terms of counting lines

Main reasons for custom input:

Provide line and column information to build a LocatedElement
Provide skippable regions for parsing

As it turns out, other real languages like Rust treat non-doc comments like some form of whitespace. I hadn't even considered identifying comment regions and replacing with whitespace.

Additionally, as long as we have the original input, we can go back and compute line and column information based on an offset. As seen with nom_locate, you need a little unsafe code to go back to the beginning of a fragment using its offset. There are some optimizations we could do such as pre-computing the newline positions. but we'd need to pass in some extra information.

chipsenkbeil commented 4 years ago

Still need to maintain at least an offset, which would involve a custom implementation of traits (ugh). Main reason to not use nom_locate is the upfront cost when slicing. If we calculate the line & column for an offset at the end, we can just run through the entire string one last time.

Challenge is knowing when at beginning of a line, which is a parser I have and use for block elements.

Is it even needed? I don't think pandoc uses anything like this. Given we know that we start at the beginning of our first line, as long as we consume the newline with each block element.
Can we just look one character back (would require unsafe block or keeping around entire slice) to see if it's a newline?

chipsenkbeil commented 4 years ago

Nearly done. Span is implemented and at first performance tanked, but turns out that this was in big part due to calls to find the current line and column, which is expensive and may even be thrown away.

While I'm not sure how common the throw away is, I do know that it is expensive. With #57 (AST), we have control over an intermediate type that maintains the offset (similar to LocatedElement) and when transforming the AST into the actual elements, we calculate the line and column at that point.

The advantage here is that we could provide some sort of state-based converter that walks through the input, keeping track of line locations relative to their offsets.

chipsenkbeil commented 4 years ago

Sped things up. Main strength was control over not calling line/column calculations. Since I know vim supports retrieving a byte position, I think the easiest thing to do is support a byte offset for lookup instead of line/column position. Much, MUCH cheaper.