Automatically transfer data between nodes for remote series
and dataframes and perform distributed garbage collection.
The functions in Explorer.DataFrame and Explorer.Series
will automatically move operations on remote dataframes to
the nodes they belong to. This module provides additional
conveniences for manual placement.
Implementation details
There is a new module called Explorer.Remote.
In order to understand what it does, we need
to understand the challenges in working with remote series
and dataframes.
Series and dataframes are actually NIF resources: they are
pointers to blobs of memory operated by low-level libraries.
Those are represented in Erlang/Elixir as references (the
same as the one returned by make_ref/0). Once the reference
is garbage collected (based on refcounting), those NIF
resources are garbage collected and the memory is reclaimed.
When using Distributed Erlang, you may write this code:
However, the code above will not work, because the series
will be allocated in the remote node and the remote node
won't hold a reference to said series! This means the series
is garbage collected and if we attempt to read it later on,
from the caller node, it will no longer exist. Therefore,
we must explicitly place these resources in remote nodes
by spawning processes to hold these refernces. That's what
the place/2 function in this module does.
We also need to guarantee these resources are not kept
forever by these remote nodes, so place/2 creates a
local NIF resource that notifies the remote resources
they have been GCed, effectively implementing a remote
garbage collector.
TODO
[x] Make collect in dataframe transfer to the current node
Automatically transfer data between nodes for remote series and dataframes and perform distributed garbage collection.
The functions in
Explorer.DataFrame
andExplorer.Series
will automatically move operations on remote dataframes to the nodes they belong to. This module provides additional conveniences for manual placement.Implementation details
There is a new module called
Explorer.Remote
. In order to understand what it does, we need to understand the challenges in working with remote series and dataframes.Series and dataframes are actually NIF resources: they are pointers to blobs of memory operated by low-level libraries. Those are represented in Erlang/Elixir as references (the same as the one returned by
make_ref/0
). Once the reference is garbage collected (based on refcounting), those NIF resources are garbage collected and the memory is reclaimed.When using Distributed Erlang, you may write this code:
However, the code above will not work, because the series will be allocated in the remote node and the remote node won't hold a reference to said series! This means the series is garbage collected and if we attempt to read it later on, from the caller node, it will no longer exist. Therefore, we must explicitly place these resources in remote nodes by spawning processes to hold these refernces. That's what the
place/2
function in this module does.We also need to guarantee these resources are not kept forever by these remote nodes, so
place/2
creates a local NIF resource that notifies the remote resources they have been GCed, effectively implementing a remote garbage collector.TODO
collect
in dataframe transfer to the current nodecollect
to seriescompute
to dataframenode
option to creation functions