OpenMined / PySyft

Perform data science on data that remains in someone else's server
https://www.openmined.org/
Apache License 2.0
9.51k stars 1.99k forks source link

`__del__` of ObjectPointers on Python shutdown throws errors #3527

Closed Syzygianinfern0 closed 4 years ago

Syzygianinfern0 commented 4 years ago

Describe the bug On execution of even a very basic code such as

import torch as th

import syft as sy

hook = sy.TorchHook(th)
alice = sy.VirtualWorker(hook, id="alice")
bob = sy.VirtualWorker(hook, id="bob")
crypto_provider = sy.VirtualWorker(hook, id="james")

a = torch.ones(1, 5)
a = a.encrypt(workers=[alice, bob], crypto_provider=crypto_provider)

a error is thrown when python shuts down.

Exception ignored in: <bound method ObjectPointer.__del__ of [PointerTensor | me:51874166013 -> alice:36055571191]>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/syft/generic/pointers/object_pointer.py", line 347, in __del__
  File "/usr/local/lib/python3.6/dist-packages/syft/workers/base.py", line 267, in send_msg
  File "/usr/local/lib/python3.6/dist-packages/syft/serde/serde.py", line 40, in serialize
ImportError: sys.meta_path is None, Python is likely shutting down
Exception ignored in: <bound method ObjectPointer.__del__ of [PointerTensor | me:83736240700 -> bob:18435096342]>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/syft/generic/pointers/object_pointer.py", line 347, in __del__
  File "/usr/local/lib/python3.6/dist-packages/syft/workers/base.py", line 267, in send_msg
  File "/usr/local/lib/python3.6/dist-packages/syft/serde/serde.py", line 40, in serialize
ImportError: sys.meta_path is None, Python is likely shutting down

To Reproduce Steps to reproduce the behavior:

  1. Go to https://colab.research.google.com/drive/1CSQGnxDZo0TwMjERRpLYKTtDmQoIu9LK?usp=sharing or run given snippet pasted in a .py file from your terminal/IDE
  2. See error

Expected behavior Python must terminate without throwing an error

shubham3121 commented 4 years ago

@karlhigley not able to replicate the error, both in google colab and in local terminal.

karlhigley commented 4 years ago

I have seen this happen before, though it may not be consistent. I think the cause is PointerTensors or ObjectPointers going out of scope, having their __del__ method called (which attempts to send a message to the owner of the referenced object), and having that collide with Python shut down activities. There's not likely to be a quick fix for this though, because our GC implementation fundamentally relies on Python's GC, which offers no guarantees for when or whether the __del__ method will be executed.

Long term, the solution is probably to come up with a better way to do distributed GC.

karlhigley commented 4 years ago

@tudorcebere I've been thinking a lot about refactoring workers in order to make Protocols work, and I think we might want to consider having separate threads for message sending/receiving, message processing, and Plan/Protocol execution. If we did that, we'd need to be able to pass messages between threads, so we'd probably use thread-safe queues.

And if we had that, then maybe the __del__ hook that gets called for garbage collection when objects go out of scope could add a delete message to the queue instead of directly trying to serialize and send it. That would turn this issue into "Do we want to make sure outgoing delete messages are processed when Python shuts down? If so, how?"

Instead of sending outgoing delete messages when we shut down, it might make more sense for workers to GC remaining objects that came from workers they're no longer in contact with? Not sure, but seems possible.

karlhigley commented 4 years ago

Thinking maybe we should create a milestone for async/multi-threaded workers and assign this issue to that milestone. Anyone else have thoughts on that? I can't see a good way to address this without some form of concurrency, but that doesn't mean there isn't one. 🤔

tudorcebere commented 4 years ago

@karlhigley I think this might be an awesome idea, workers really need some love, I like the idea of making send and receive on separate threads (could this help async workers as well?). This could be a step forward the actor model as well and the stack forwarding project. (maybe we would like to stick with some custom actor model?). I am not familiar with the GC behavior, but in my mind, the idea of adding a del message when an object goes out of scope could work really nice, (should make everything more transparent as well).

karlhigley commented 4 years ago

The current GC behavior does send a delete message, but since our comms methods are currently synchronous and blocking, that means that garbage collection is a blocking operation. 🙁

github-actions[bot] commented 4 years ago

This issue has been marked stale because it has been open 30 days with no activity. Leave a comment or remove the stale label to unmark it. Otherwise, this will be closed in 7 days.