extendr / extendr

R extension library for rust designed to be familiar to R users.
https://extendr.github.io
MIT License
425 stars 43 forks source link

The doubly-linked-list approach on protecting R objects? #468

Open yutannihilation opened 1 year ago

yutannihilation commented 1 year ago

I think I now understand how the code on this file works; extendr links all the R objects in use on Rust's side to a large VECSXP to protect them from being GC-ed by R accidentally.

https://github.com/extendr/extendr/blob/master/extendr-api/src/ownership.rs

This mechanism was introduced by https://github.com/extendr/extendr/pull/175, and then improved by https://github.com/extendr/extendr/pull/398. But, I'm wondering if we should switch to the doubly-linked approach, which is adopted in cpp11. Here's some references about the approach:

The possible advantage of this approaches is that, since there's no limit (other than the memory) on the number of R objects, we don't need to do GC, and need to worry less about issues like https://github.com/extendr/extendr/issues/397.

On the other hand, I'm not sure whether this is good or bad in terms of performance. We don't need to allocate the big vector, but we need to add tags every time a new R object is created. Anyway, I decided to file an issue for discussion, as I cannot find any existing issue filed here.

multimeric commented 1 year ago

In my understanding, these two are not what you can compare. PROTECT is to prevent the objects from getting accidentally GC-ed during the function call. The object must be UNPROTECT-ed before the end of the function. If you want to keep protecting the objects, you have to get it referenced from another SEXP under the protection of R_PreserveObject(). WRE and references in https://github.com/extendr/extendr/issues/468 might help.

I know PROTECT is only during the function call, but why would we want to keep protecting objects after they're returned? We want to keep objects alive if they're being referenced in the user's environment, but collect them if they're not, which is exactly what should happen by default. I don't think R C code has a preservation mechanism other than PROTECT during function calls??

yutannihilation commented 1 year ago

I think it's a different topic than this issue itself?

multimeric commented 1 year ago

It's related though. You're looking into optimizing this protection mechanism, I'm trying to understand it first.

yutannihilation commented 1 year ago

We want to keep objects alive if they're being referenced in the user's environment, but collect them if they're not

As you can easily imagine, that's not always the case. For example, a function might return an external pointer to a struct containing SEXPs. But, honestly I don't have enough knowledge to answer your question. I've been studying cpp11's source code, but yet to get a sense of when to use PROTECT and when to use the doubly-linked list.

I don't think R C code has a preservation mechanism other than PROTECT during function calls??

R_PreserveObject() / R_ReleaseObject(). Did you read the section of WRE that I pointed?

yutannihilation commented 1 year ago

Anyway, I'm closing this issue because I found this approach probably won't be suitable for extendr. My take can be found here:

https://github.com/yutannihilation/unextendr/blob/ec88a43411a54a720b4071b7c0adbfe597fda18e/src/rust/src/protect.rs#L1C1-L26

multimeric commented 1 year ago

As you can easily imagine, that's not always the case. For example, a function might return an external pointer to a struct containing SEXPs.

Thanks, that's a good counter-example, but I would have thought that's a rare enough case that should be handled explicitly instead of automatically and implicitly.

R_PreserveObject() / R_ReleaseObject(). Did you read the section of WRE that I pointed?

Yes. Sorry what I meant was that most R + C code only uses PROTECT(), it's not an automatic requirement that everything be protected via R_PreserveObject(). So consequently it's surprising for me that all Robj in extendr are basically protected by a similar mechanism.

This would be a great thing to discuss on the Discord. We are chatting about this before we formalise anything into issues that need actioning. Using discord might also satisfy your preference to keep discussion in issues strictly about those issues.

yutannihilation commented 1 year ago

Thanks, but Discord is not my preference, unfortunately. I'll move away from extendr soon as it's becoming clear that joining Discord is mandatory to be a member of extendr. I think I'm not eligible.

multimeric commented 1 year ago

Using Discord is certainly not mandatory, and you are welcome to your preference. I'm just struggling to reconcile this need to discuss extendr's concepts like the protection system with GitHub issues, which normally mark a specific problem that needs action. Perhaps a Discussion thread would be better?

CGMossa commented 1 year ago

@yutannihilation i would personally be very sad if you left the project due to discord. Your contribution is invaluable.

I've read this issue several times. And I've also investigated this as much as I can. Can you tell me why you don't think this approach is suited for the rust code? It certainly seems apealing at least to me.

Incurring a hash cost for every protect and unprotect is costly.. Even if we argue it is o(1). Then there is the reallocation step that must happen when a threshold is met.

I've read your linked source, and yes the translation unit thing is a c++ thing, unwinding doesn't concern us as well.. But what is it that I'm not reading?

yutannihilation commented 1 year ago

But, honestly, I feel it will be simpler if you make Discord mandatory, no? You can discuss and make decisions quickly there. It's just my English is too poor to keep up with the speed.

Can you tell me why you don't think this approach is suited for the rust code?

It should work for the Rust code as well, but I'm just not sure what design decision is behind extendr's current implementation using HashMap. My best guess is to support parallelism (HashMap is Send + Sync). If so, linked-list might be a wrong option.

CGMossa commented 1 year ago

Discord will not be mandatory ever.

I'm personally involved in another project, where discord conversations have too much power, and we are not going to devolve into that.

I agree that there may have been an ambition to make extendr multithreaded. But everywhere there is interaction with R API, single_threaded is used. I'd like to explore the linked-list approach, especially now since you've uncovered so much of it. Thus, if you'd allow it, I'm reopening this issue @yutannihilation.

Question: As far as I understand it, the hashmap we use goes out of scope, when we return to R context, thus there is no need to store the preserve list anywhere, like they do in cpp11. I just don't see where in the code it says that. And if that is indeed true, then Ownership needs a Drop impl.. To drop the preserve list or in this case the preserve double-list.

We also primarily interact with users of Extendr, and they help uncover these things. They have been very helpful, responsive on the discord, but they tend not to post too much on Github. Even if we constantly encourage that they do. Finally, I'd like to say that I receive a lot of help and mentoring on Discord.

If you don't mind, I think we should start using Discussion in GitHub to discuss these issues, as we are obviously stronger when we collaborate, rather than part ways due to obstruction of communication.

yutannihilation commented 1 year ago

Reading https://github.com/extendr/extendr/issues/498#issuecomment-1531142975 again, it seems extendr has no option but having HashMap to track the reference count (to be clear, it should be possible to replace the large VECSXP with the doubly-linked list). Could extendr have prohibited Clone for Robj? But it's still not perfect in that the user might create two different Robj that point to the same SEXP, which will result in accidental unprotect on Drop, though I'm not sure if this can actually happen.

Also, I'm not immediately sure what makes the design such different than cpp11. Maybe distinguishing the owned (writable) SEXP from the one brought from outside?