erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.38k stars 2.95k forks source link

ERL-240: NIF reload being called still? Is it not deprecated? #3373

Closed OTP-Maintainer closed 3 years ago

OTP-Maintainer commented 8 years ago

Original reporter: vans163 Affected version: OTP-19.0 Fixed in version: OTP-19.1 Component: Not Specified Migrated from: https://bugs.erlang.org/browse/ERL-240


R19.0.5

stdlibs l/1 does
code:purge(jiffy).
code:load_file(jiffy).

This procs upgrade on the module if its a c nif. As expected.

But now if you do:

code:purge(jiffy).
code:delete(jiffy).
code:load_file(jiffy).

This procs reload on nifs.

Why is reload even procing? This should call upgrade no?
Reload says its deprecated, there should be no way it gets called in newer releases?

This is problematic because the reload takes precedence over upgrade, and upgrade never gets called. Ending up breaking some nif libraries that ignore reload.
If code:delete removes the modules entirely, the load_file should place it back without calling reload nor upgrade?

References:
https://github.com/erlang/rebar3/pull/1317
https://github.com/erlang/otp/blob/maint/lib/stdlib/src/c.erl#L253-L255
https://github.com/erlang/rebar3/blob/master/src/rebar_agent.erl#L112
OTP-Maintainer commented 8 years ago

sverker said:

"reload" should not be called in this case, "upgrade" should be called instead.
This is a bug existing since 19.0.

The problem is code:delete followed by code:load_file of a module
with an on_load function (which is usually used to load nif lib).
The sequence code:delete followed by code:load_file is quite redundant.
code:load_file will do code:delete anyway if needed (in a correct a correct way).
The only difference is that an explicit call to code:delete will leave
a small period of time in between when no version of the module is callable.

So, as a workaround remove the redundant call to code:delete().

"reload" is deprecated yes, but it can still be called if the *same module instance*
does repeated call(s) to erlang:load_nif/2. But this use is discouraged and it seems
reasonable to remove "reload" totally in OTP 20.
OTP-Maintainer commented 8 years ago

vans163 said:

Great thanks. code:delete will be removed from use in rebar3 then.  That should fix this.
OTP-Maintainer commented 8 years ago

vans163 said:

Actually this produces another issue.

Without calling code:delete, even if the nif calls upgrade successfully now, the nif code is not upgraded!

For example:

{code:erlang}
% upgrade() {return 0}
% test(){return "ok"}

nif:test().
"ok"

% change to return "okk" and recompile
% test(){return "okk"}

code:purge(nif).
code:load_binary(nif).
nif:test().
"ok"

code:purge(nif).
code:delete(nif).
code:load_binary(nif).
=WARNING REPORT==== 1-Sep-2016::13:53:08 ===
The on_load function for module nif returned {error,
                                                  {reload,
                                                   "Reload not supported by this NIF library."}}
{error,on_load_failure}

nif:test().
"okk"
{code}
OTP-Maintainer commented 8 years ago

sverker said:

Are you compiling the new nif.so file over the old one?

At least on Linux that does not work. The nif lib is loaded by calling dlopen("/path/to/nif.so").
If dlopen is called with the same path as an already loaded lib, it will think it's the same
and just increase a reference counter for the lib.

You can use command "lsof -p <pid>" for the beam.smp process
to see which dynamic libraries it has currently opened.
After a successful loading of the new nif.so (causing upgrade callback to be called)
lsof should show both the old and new nif.so as the old lib is still loaded
until it's purged.
OTP-Maintainer commented 8 years ago

vans163 said:

Yes compiling new nif over the old.

I see okay that makes sense, so then purging again after the load_binary should remove the ref to the old nif and keep the new one. But it seems to keep using the old?

{code:erlang}
rebar3 shell

> inotify:init().
%this is the path that is passed to erlang:load_nif , followed by return from erlang:load_nif on next line, followed by {ok, 11} which is the nifs init call return value.
"/home/user/project/inotify/_build/default/lib/inotify/priv/inotify"
ok
{ok, 11}.

% lsof -p <pid>
% beam.smp 5081 user  mem       REG              254,1    13312  8260999 /home/user/project/inotify/priv/inotify.so
% beam.smp 5081 user   11r  a_inode               0,11        0     8304 inotify

> code:purge(inotify).
false
> code:load_file(inotify).
"/home/user/project/inotify/_build/default/lib/inotify/priv/inotify" 
ok 
{module,inotify}
> code:purge(inotify).
false

> inotify:init().
{ok, 11}.
%should be okk

% lsof -p <pid>
% beam.smp 5081 user  mem       REG              254,1    13312  8260999 /home/user/project/inotify/priv/inotify.so
% beam.smp 5081 user   11r  a_inode               0,11        0     8304 inotify
% beam.smp 5081 user   12r  a_inode               0,11        0     8304 inotify

% purge again and load_file adds another 
% beam.smp 5081 user   13r  a_inode               0,11        0     8304 inotify
% delete does not remove any of the a_inode 

{code}

So each time a purge+load_file is done another "beam.smp 5081 user   13r  a_inode               0,11        0     8304 inotify" is added. But even though the library returns 0 from upgrade, the new library is not called into, even if I do purge,load_file, purge. 

Something strange though. If upgrade returns non 0, then the nif library fails to load, loading it again after the failure causes the new version to override the old.

The paths are different on the loaded .so and what is printed to erlshell because the nif (loaded by rebar3)is in a symlinked priv folder to the project local priv.  I do not think this should have any bearing though?

It seems this is perhaps revelent http://erlang.org/pipermail/erlang-bugs/2015-December/005139.html
OTP-Maintainer commented 8 years ago

sverker said:

I sense some confusion.

There are two potential code generations for each module,
*current* and *old*.

code:purge will only deaallocate/unload old code (if any)
and kill lingering processes (if any) that could otherwise
run that old code.

code:delete will only make current code into old (if the old
position is empty). No code is deallocated/unloaded.

Example:

1> code:load_file(test).
{module, test}
%% Current is test(1) and old is empty

2> code:purge(test).
false
%% Nothing happens, current is still test(1) and old is still empty.

3> code:load_file(test).
{module, test}
%% Current is now test(3) and old is test(1)

4> code:purge(test)
false
%% Current is still test(3) and old is now empty

5> code:delete(test)
true
%% Current is now empty and old is now test(3)

6> code:purge(test).
false
%% Both are now empty
OTP-Maintainer commented 8 years ago

vans163 said:

I understand that, so you are saying it is impossible to hotload a new verison of a NIF that uses upgrade on linux? What is the purpose of upgrade callback then?

Currently the only way I see to hotload a new version of nif on linux is to use reload (by using code:delete/1), but that is deprecated.

If I misunderstood and it is indeed possible to hotload a new version of the same NIF on linux (and start calling into the new version vs the old) do you mind sharing how, thanks.

{code:text}
int (*upgrade)(ErlNifEnv* env, void** priv_data, void** old_priv_data, ERL_NIF_TERM load_info)
upgrade is called when the NIF library is loaded and there is old code of this module with a loaded NIF library.

Works the same as load. The only difference is that *old_priv_data already contains the value set by the last call to load or reload for the old module code. *priv_data will be initialized to NULL when upgrade is called. It is allowed to write to both *priv_data and *old_priv_data.

The library will fail to load if upgrade returns anything other than 0 or if upgrade is NULL.
{code}

The description of upgrade leads one to believe, if it is ever called, that means that an old version of the nif was found and for that version to be old a newer version must be present.

So to be clear doing code:purge(nif), code:load_file(nif). Upgrade is called. Upgrade returns 0.  Erlang still uses the old version.

When doing an extra code:delete before the code:load_file. Reload is called. Reload returns 0. Erlang starts using the new version.

Anytime upgrade is called and returns 0, one is led to believe that future calls now will go to the "upgraded" nif library.  If this is not the case on linux I think it should be noted in the documentation, as its quite confusing.
OTP-Maintainer commented 8 years ago

sverker said:

It's impossible to hotload a new version of a NIF that uses upgrade on linux IF you use the same path/filename of the .so file as the old lib.
And that is due to the behaviour of system call dlopen().
It is possible if you use a different path/filename for the .so file.

Another way is to do code:delete and code:purge THEN you can load a new version of the NIF with the same .so path/filename
but then "load" is called instead of "upgrade" as the old lib has already been unloaded by code:purge.

Using code:delete to somehow get  the reload callback is not a solution.
When you do code:delete followed by reload of module with on_load in OTP-19
the internal module structure gets corrupted which can lead to all kinds of
strange behavior, not just the reload callback. I even got a Segmentation fault
crash when I tried just now.

You say "and start calling into the new version vs the old"?
The only way you can call old code (erlang or NIF) is from a lingering process
that was "stuck" in the old code when the upgrade happened.
All calls like foo:bar() will reach the *current* code.
OTP-Maintainer commented 8 years ago

vans163 said:

Okay I see, so doing a code:delete, code:purge, code:load_file seems to work fine, in that specific order.

I still think the beh on linux should be fixed then.  Anytime upgrade gets called you kind of assume that means the new lib will be used.  Upgrade should not be called if it will keep using the old library, that IMO leads to undefined behavior, especially if your deallocing the old priv_data and passing a new one to new priv_data pointer (since you assume upgrade return 0 will start using new lib), you will segfault or worse go into inconsistent state.
OTP-Maintainer commented 8 years ago

sverker said:

The "upgrade" callback is called when there is an already loaded nif lib for the (old) module instance.
It does not matter if the newly loaded nif lib is the exact same as the old lib,
upgrade is still called.
You can use a static variable initialized to zero and increased by load and upgrade
to detect if the same lib is reused.
OTP-Maintainer commented 8 years ago

sverker said:

You also have the perfectly reasonable case of
upgrading the Erlang code of a module while keeping the same .so file for the NIFs.

upgrade is then called and the implementation can choose
what to do with priv_data, for example
* Let the new module instance take over the old priv_data
* Let them share the same priv_data (with some reference counting maybe)
* Allocate new priv_data for the new module instance.
OTP-Maintainer commented 8 years ago

vans163 said:

I see, well I am glad this (so far) seems to work for rebar3.  But I am still a little concerned about future users running into this issue as the behavior is not well documented. You mention in R20 reload will probably be removed?

So then in R20 doing code:purge(nif), code:delete(nif), code:load_file(nif) would call upgrade? (instead of reload)

This would made it so no matter the order of the purge/delete before the load_file, the nif would be upgraded.

But doing just purge + load_file would call upgrade without removing the old version as its not marked old (by delete).  

This kind of behavior in R20 seems more consistent to me.
OTP-Maintainer commented 8 years ago

sverker said:

Yes, purge + delete + load_file will call upgrade instead of reload.
Hopefully fixed already in 19.1 as this is quite a serious data corruption bug.

No matter if you do code:delete or not before code:load_file
the nif will be upgraded. If you already have both current and old versions loaded,
then you *must* purge the old first otherwise delete/load will fail.

When the bug is fixed, a successful purge + load_file will have
the same behavior as doing purge + delete + load_file.
As I said before, the delete is redundant.
...or even harmful in the case when load_file fails
as you end up with no callable (current) version of the module.
OTP-Maintainer commented 8 years ago

sverker said:

And yes, I agree the erl_nif docs could be improved to make the upgrade case more clear.
Contributions are welcome.
OTP-Maintainer commented 8 years ago

vans163 said:

I see the harm in delete now.  Okay I will try to get something in for R19.1 in terms of making the docs clearer.  Glad we got to the bottom of this.
OTP-Maintainer commented 8 years ago

vans163 said:

Just a note I did not notice before:

code:delete(nif), code:purge(nif), code:load_file(nif).  Does not call upgrade, it does not call anything. I assume the shared lib is fully unloaded from the erlang process and the new version loaded.

code:purge(nif), code:delete(nif), code:load_file(nif). This calls reload, this is a bug. Will be replaced in future with upgrade.
OTP-Maintainer commented 8 years ago

sverker said:

delete + purge + load should call "load" (if implemented) and does call "load" when I test it on 19.0.5.
OTP-Maintainer commented 8 years ago

vans163 said:

Yes. I meant it does not call upgrade or reload.  So maybe there could be a state where a call is made to the NIF, but the NIF is not loaded. If the NIF is called between the purge and load_file I assume.