Eta0last week
This is more of a general comment as it doesn't block this PR, but as an optimization we should operate on detached copies of tensors to better control tensor lifetime and their garbage collection impact (by manually set_-ing them to empty buffers at the end of the serialization process). The handling of futures in this PR makes the object lifetime a little unclear, but depending on what function deallocation triggers in, it can reduce performance by a bit. Plus, deallocation can even be shunted to a background thread right before returning from the function that way.