Closed RazrFalcon closed 1 year ago
I think the discussion from #71 applies, i.e. this is significantly hurting the case where I do not want to take ownership of the strings by adding reference counting and it suddenly makes Document: !Send
which would for example significantly hurt ergonomics for myself using roxmltree in multi-threaded async server code. The alternative of using Arc<str>
would hurt the single-threaded case and code not interested in taking ownership even more.
So in summary, I think shared ownership via reference counting moves too far away from roxmltree's original design goals. And while it will be much harder to implement, I think providing on-demand normalization which would return Cow<'input, str>
by value anyway would be preferable to the approach presented here.
As a technicality, we should probably use Rc<str>
instead of Rc<String>
since long-term the shared text is immutable anyway, but of course this prevents us from using &mut String
via Rc::make_mut
while we build up the text values. (But that part of the parser is a bit of a code smell anyway IMHO as it is the one place where we modify (instead of just building) the read-only tree.) ((As an aside, we have a &mut SharedString
and hence &mut Rc<String>
there, so it should be possible to use Rc::get_mut
instead of Rc::make_mut
.))
Are you sure about the performance? I just checked and it's identical. After all, we're rarely allocating strings.
Arc
has ~1% performance penalty, but it's neglectable.
Overall, I agree that this a complicated design topic. I have a use case (resvg
), where currently I allocate almost all string anyway. Because dealing with lifetimes becomes too annoying/impossible.
But even in this case using shared strings will improve the performance, since large attributes almost always have new lines, i.e. normalized, and I'm needlessly reallocating them.
I've tried storing SharedString
directly in my own data structure, but lifetimes are a mess. In the end, I wasn't able to compile my code. So while it would be the perfect solution, the implementation would require some extreme lifetimes gymnastics. Afaik I would need three lifetimes to achieve what I want.
And as you're already mentioned, delayed normalization would require preserving too much info, making everything way too complicated.
Anyway, I'm fine with Arc
, if this is satisfies your use case. I think this change is a good middle-ground.
Are you sure about the performance? I just checked and it's identical. After all, we're rarely allocating strings. Arc has ~1% performance penalty, but it's neglectable.
Not without measuring. Just two caveats I see:
&'a Cow<'input, str>
falls under the same argument without the costs of reference counting.Overall, I agree that this a complicated design topic. I have a use case (resvg), where currently I allocate almost all string anyway. Because dealing with lifetimes becomes too annoying/impossible. But even in this case using shared strings will improve the performance, since large attributes almost always have new lines, i.e. normalized, and I'm needlessly reallocating them.
You mean that you end up storing Arc<str>
in your structures, i.e. bumping the reference count for normalized attribute values and allocating/copying for unnormalized attribute values? Doesn't this pessimize the unnormalized attribute values? What if they are much more common?
The cache coherence traffic introduced by using atomics is harder to measure in isolated benchmarks as it measuring it depends on produce load on multiple cores.
I understand that Arc
isn't free, but I cannot imagine a use case where it would become a bottleneck. Especially compared with XML parsing itself.
You're optimizing for an unrealistic case.
then cloning an exposed &'a Cow<'input, str> falls under the same argument without the costs of reference counting
But it will reallocate a string, while Arc is just a ref bump.
You mean that you end up storing Arc
in your structures, i.e. bumping the reference count for normalized attribute values and allocating/copying for unnormalized attribute values?
Yes. Before I was allocating all strings, so it couldn't get any worse than that. Sure, it would be great to keep original &str
, but I couldn't solve myriad of lifetime issues.
And even if I would be able to solve lifetime issues, I would still benefit from Arc
and not Cow
.
Honestly, it seems like the easier solution would be to simply have:
enum SharedString {
Borrowed(Range<usize>),
Owned(Arc<str>),
}
This way we can forget about lifetimes completely :/
You're optimizing for an unrealistic case.
As discussed above, I think with on-demand sharing, we can have our cake and it too. When Arc<str>
is created only for those strings on which shared_value
is actually called, I would have absolutely no objections.
I do think fn make_shared(&mut self) -> Arc<str>
should become a method on Text
tough as we'll need for more than just attribute values. I do not think that enum Text
needs to be public though as we never expose it in the API if we have only methods like shared_value(&mut self) -> Arc<str>
.
This still uses into_cow
for going from BorrowedText
to SharedString
which does not really fit. I would suggest using Text
instead of SharedString
and thereby into_text
for the method. Making it an implementation detail and putting it next to BorrowedText
would seem helpful to me as well.
I will put in on hold until the next weekend.
I still do not see how Arc
can become a bottleneck, since 99% of the time would be spent on XML parsing alone. And attribute value normalization is triggered pretty rarely.
Anyway, in my use case I still cannot store SharedString
directly, because of lifetime issues. And unless I will be able to solve them I would not need this change anyway.
I guess returning just a &str
is our only option. Otherwise it's either Arc
or mutable tree.
Ok, I was able to beat lifetimes. Now this patch works perfectly for my use case. So I will probably merge it on the next weekend.
Cow
is no go for me, because cloning would be way too expensive.
This leaves us with normalization on access, by returning Cow<str>
instead of &str
, which is meh.
Or something else...
Also, I think it would be a good idea to make Arc optional, via a build feature. To switch between Rc
and Arc
.
I am resigned to Arc<str>
now personally, the only open discussions I see here are comments on the current implementation. I would be happy if they could be addressed before merging this.
Also, I think it would be a good idea to make Arc optional, via a build feature. To switch between Rc and Arc.
If we do that, we could theoretically have two features, e.g. rc
for non-thread-safe shared text and arc
for thread-safe shared text and just Owned(String)
if neither is enabled.
Yes, I will address your comments.
Yes, I guess we can do this. A better solution would be to have Document
generic over the string storage type, but it would make code even more complex.
I think I have addressed everything. I do not plan merging it until the weekend anyway.
@adamreichold how about this regarding #88 ?
Sure, it makes allocated strings a bit slower and introduces indirection. But it's the easiest way I can think of to preserve strings after dropping a
Document
.