Safe parser - alternative API with &str parameter in read()

mlwilkerson commented 3 years ago

closes: #6

The alternative API in this PR requires that the content passed into the top-level read() function is already a &str. Therefore the client owns the memory allocation, and this &str can be passed around the rest of the framework safely.

It seems to have the advantage over #39 that there's no extra copy being made in memory, and that there's again only one parameter to read().

I prefer this one over #39.

IvanUkhov commented 3 years ago

Thank you for this pull request and #39! It has already been quite some time since the problem was discovered in #6, and no solution that would preserve the current interface has been found so far.

With this alternative, I am a bit concerned that the end user would have to manage the memory on their own. It might not be a problem if the typical scenario is that one does the whole parsing in a single function without trying to pass things around.

I have tagged both with “feedback needed” for gathering some feedback from others.

IvanUkhov commented 3 years ago

I have stumbled upon this leaky alternative by @Skynoodle:

https://github.com/Skynoodle/svg/commit/e65faa79a34a4bccc886f03c236da0a469cf4925

But it seems, for complete correctness, the API has to change. I will then merge this pull request if there are no objections.

Skynoodle commented 3 years ago

My branch was mostly just me trying to figure out the lifetimes involved and coming to the same conclusion that a correct, non-leaking fix would require either this sort of API change, or changes to various reference types returned elsewhere. Fixing it properly like this is definitely preferable, and the changes here all look good to me 👍

(To expand slightly on the alternative api changes based on my understanding: where currently &'l _ references are passed out referring to the buffer, guard objects implementing deref would likely be needed instead, and something like reference counting in the case of an owned buffer would be required)

mlwilkerson commented 3 years ago

@Skynoodle I don't follow your comment here (I don't disagree, I just don't understand yet.) It sounds like you're referring to Rust concepts that are just at the edge of my Rust knowledge, so I'm not sure where they apply here.

where currently &'l _ references are passed out referring to the buffer, guard objects implementing deref would likely be needed instead, and something like reference counting in the case of an owned buffer would be required

Could you expand a little more? I'm curious.

Skynoodle commented 3 years ago

Could you expand a little more? I'm curious.

Sure! To be clear, that aside in my comment doesn't directly apply to this MR. I was considering an alternative set of API changes which still allowed passing in an owned buffer and getting an arbitrarily long lifetime 'l back out. The difficulty then is that 'l might outlive the Parser itself, but various methods return things like slices into the buffer with lifetime 'l. One would either need to shorten the lifetime of these returned references to at most the lifetime of &self for those methods, or return a guard structure instead of an &'l _. The possibly-owned buffer would then need to be kept alive at least as long as the lifetime of the Parser and any outstanding guards handed out. These guards would themselves have lifetime at most 'l, so in the borrowed-buffer case could be trivially sound (like the existing references), but in the owned-buffer case would need to use something like reference counting to avoid the owned buffer being destroyed early. By implementing Deref for these guards, an API with the same spirit as the existing one would be provided (and should "just work" in most of the existing API's correct uses).

This sort of approach would probably add substantial complication to a lot of the Parser implementation, though, so without strong motivation for owned-buffer usage the option in this MR seems better.

mlwilkerson commented 3 years ago

Thanks for the expansion.

I think I understand better, but not sure I get it all. Check my thinking?

passing in an owned buffer and getting an arbitrarily long lifetime 'l back out

Passing in an "owned buffer", say to the Parser::new() function, would mean: passing ownership from the client code to that function. Right?

Or does "owned buffer" refer more to returning from open() or Parser::new() a buffer which the calling code would subsequently own?

or return a guard structure instead of an &'l _

I don't think I've seen "guard structure" in the Rust literature I've read. Are you thinking of something like a Trait Bound? Or structures like Rc or Arc? (That's what I thought you may have meant after mentioning reference-counting.) Or something else?

Skynoodle commented 3 years ago

Passing in an "owned buffer", say to the Parser::new() function, would mean: passing ownership from the client code to that function. Right?

Or does "owned buffer" refer more to returning from open() or Parser::new() a buffer which the calling code would subsequently own?

I meant "owned buffer" in the sense of the Parser itself owning rather than borrowing the buffer - which both implies ownership of the buffer (rather than a borrow) is passed into Parser::new and that ownership of the buffer follows the returned Parser around afterwards (at least conceptually - the guard structures I mentioned make this a little less clear-cut).

Edit: Parser owning the buffer would itself be the result of the client code passing ownership of the buffer to the new function

I don't think I've seen "guard structure" in the Rust literature I've read. Are you thinking of something like a Trait Bound? Or structures like Rc or Arc? (That's what I thought you may have meant after mentioning reference-counting.) Or something else?

It would be functionally very similar to an Rc/Arc here, I think (it would probably be implemented using an Rc/Arc for the "owned buffer" case). Rcs could be described as a guard structure in a sense - they guard against the contained value being dropped as long as you hold an Rc. Here, I'm using guard in the same sort of sense as MutexGuard - ensuring that invariants are satisfied while the structure is held. For MutexGuard the invariant is that the lock is (exclusively) held. In our case, because "conceptually" the returned value isn't really meant to own the buffer (it's just a workaround to prevent the Parser from incorrectly dropping it early) I saw it a little more like a guard (upholding the invariant that the underlying buffer is still valid) than true shared ownership like is usually implied by Rc.

Edit: in the borrowed case,'l would be a meaningful lifetime and the guard would then "just" be essentially a `&'l - so the guard would have different internal behaviour depending on whether the originalParser` owned or borrowed its buffer_

mlwilkerson commented 3 years ago

I meant "owned buffer" in the sense of the Parser itself owning rather than borrowing the buffer

got it.

Here, I'm using guard in the same sort of sense as MutexGuard - ensuring that invariants are satisfied while the structure is held.

I see. I haven't used MutexGuard before. Cool, I'll read up on that.

One part of what intrigued me about your comment, when you mentioned reference counting, is that one of my experiments had been to use Arc. In my understanding that woulda/shoulda allowed for multiple ownership. But after chasing my tail trying to make the compiler happy about how I was trying to use that, I concluded that I must have been misunderstanding something fundamental. For you to then mention reference counting makes me wonder if perhaps some part of my intuition was right about that, and yet I just hadn't yet found the right implementation. (I'm still trying to learn how to use Arc appropriately.)

Skynoodle commented 3 years ago

I see. I haven't used MutexGuard before. Cool, I'll read up on that.

One interesting part of guard types is their existence is often almost invisible - in the case of a MutexGuard, you obtain one by calling mutex.lock() and, because the MutexGuard implements Deref and DerefMut, proceed to treat the guard mostly as though it's a plain &mut _, even though the MutexGuard is needed to unlock the mutex at the right time and prevent use of the inner value outside of a current lock.

One part of what intrigued me about your comment, when you mentioned reference counting, is that one of my experiments had been to use Arc. In my understanding that woulda/shoulda allowed for multiple ownership. But after chasing my tail trying to make the compiler happy about how I was trying to use that, I concluded that I must have been misunderstanding something fundamental. For you to then mention reference counting makes me wonder if perhaps some part of my intuition was right about that, and yet I just hadn't yet found the right implementation. (I'm still trying to learn how to use Arc appropriately.)

It would still be rather fiddly, since holding an Arc and a reference into that Arc at the same time in a structure would also be self-referential so still awkward with Rust's rules. One could work around that in one of the usual ways - either by holding an Arc and indices in the guard structure, or guaranteeing that the contents of the Arc are pinned in memory and use raw pointers with unsafe, only producing the required references in Deref itself.

IvanUkhov commented 3 years ago

Thank you all!

bodoni / svg

Safe parser - alternative API with &str parameter in read() #40