erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.37k stars 2.95k forks source link

ERL-70: NIF resources are not checked on module re-/un-load #3296

Closed OTP-Maintainer closed 3 years ago

OTP-Maintainer commented 8 years ago

Original reporter: bucko909 Affected version: OTP-18.1 Component: erts Migrated from: https://bugs.erlang.org/browse/ERL-70


While playing with implementing a NIF, I found some segfaults, and I eventually got it down to the test case here:

https://gist.github.com/bucko909/a841c716ede6d3903a13 (also attached as .zip).

It looks like it's down to my not re-registering the resource on upgrade (presumably the handle goes stale, is garbage collected, and eventually it corrupts memory causing segfaults in unrelated emulator code).

I fell into this trap by using code from https://github.com/davisp/nif-examples -- which I've sent a pull request to fix.

I fixed my problem by adding enif_open_resource to the upgrade function once I'd clocked my error, so under normal and correct use, I think the emulator is doing OK.

However, it looks like if I _don't_ reopen it, it's not properly deleted, and the documentation seems to leave open the possibility of doing just this ("Existing resource objects, of a module that is upgraded, must either be deleted or taken over by the new NIF library"). References to resources with the old handle remain uncleaned. Even if I completely destroy the old module, so that unload is called, these stale resources persist until a garbage collection. They actually survive _many_ purge/load cycles in my example code before being garbage collected and segfaulting the emulator.

Ideas, based on my interpretation of the bug:

If there are lingering resources, which are not TAKEOVER-ed in the upgrade function, and have a dtor, this should cause an immediate emulator panic. I can't think of any other behaviour which is safe here. If they don't have a dtor, it seems safe to keep them around, but their resource handle needs to be kept alive until they are all destroyed. It ought to be impossible to create new resources using the old handle, at least when there is a dtor defined (can a 'dead' flag be set?).

Knowing this behaviour, an application author writing an upgrade function for this NIF library might at least attempt to destroy all of his objects when making such an upgrade, in order to have the emulator survive!

Another approach is to require an _explicit_ delete of old resources, perhaps simply a call to "enif_delete_unused_resources" or an iteration of "enif_delete_resource" over "enif_list_resources", and have this call fail where the old resources are still allocated. Perhaps the library author could force a purge or panic the emulator themselves at this point. The emulator should panic if a resource is neither deleted nor reopened with TAKEOVER.
OTP-Maintainer commented 8 years ago

sverker said:

Hi

As you said, the attached test code is faulty, as it reuses an old resource type handle without calling enif_open_resource_type in upgrade().

It's perfectly ok to *not* call enif_open_resource_type in upgrade(), if the upgraded module does not need the resource type to identify existing resource objects.

The current solution is to postpone unloading of a NIF library until all existing resource objects, of all its abandoned resource types with destructors, have been garbage collected. That is, a NIF library will not be unloaded until we know all calls to its destructor functions have been done. You should able to observe this as the unload() callback is also postponed until the library is actually closed.

If this is really true:

"Even if I completely destroy the old module, so that unload is called, these stale resources persist until a garbage collection."

...that would indicate a bug as unload() should not be called until all "stale resources" with destructors
have been garbage collected. Some test code to reproduce that would be neat.

/Sverker, Erlang/OTP
OTP-Maintainer commented 8 years ago

bucko909 said:

Yep, it looks as though you're correct. Even if I force a new copy of the nif library with a new filename, the old dtor is called correctly on the old resources. I swear I /did/ produce the effect above, but I can't do it now, so I was probably just imagining it or using a stale resource handle to make a new resource.

It'd be great if creating invalid resources in this way caused an error, but I suppose my example code was reading from an invalid memory address anyway when it used the stale handle, so there's not much that can be done.