Closed Syzygianinfern0 closed 4 years ago
@karlhigley not able to replicate the error, both in google colab and in local terminal.
I have seen this happen before, though it may not be consistent. I think the cause is PointerTensors
or ObjectPointers
going out of scope, having their __del__
method called (which attempts to send a message to the owner of the referenced object), and having that collide with Python shut down activities. There's not likely to be a quick fix for this though, because our GC implementation fundamentally relies on Python's GC, which offers no guarantees for when or whether the __del__
method will be executed.
Long term, the solution is probably to come up with a better way to do distributed GC.
@tudorcebere I've been thinking a lot about refactoring workers in order to make Protocols
work, and I think we might want to consider having separate threads for message sending/receiving, message processing, and Plan/Protocol
execution. If we did that, we'd need to be able to pass messages between threads, so we'd probably use thread-safe queues.
And if we had that, then maybe the __del__
hook that gets called for garbage collection when objects go out of scope could add a delete message to the queue instead of directly trying to serialize and send it. That would turn this issue into "Do we want to make sure outgoing delete messages are processed when Python shuts down? If so, how?"
Instead of sending outgoing delete messages when we shut down, it might make more sense for workers to GC remaining objects that came from workers they're no longer in contact with? Not sure, but seems possible.
Thinking maybe we should create a milestone for async/multi-threaded workers and assign this issue to that milestone. Anyone else have thoughts on that? I can't see a good way to address this without some form of concurrency, but that doesn't mean there isn't one. 🤔
@karlhigley I think this might be an awesome idea, workers really need some love, I like the idea of making send and receive on separate threads (could this help async workers as well?). This could be a step forward the actor model as well and the stack forwarding project. (maybe we would like to stick with some custom actor model?). I am not familiar with the GC behavior, but in my mind, the idea of adding a del message when an object goes out of scope could work really nice, (should make everything more transparent as well).
The current GC behavior does send a delete message, but since our comms methods are currently synchronous and blocking, that means that garbage collection is a blocking operation. 🙁
This issue has been marked stale because it has been open 30 days with no activity. Leave a comment or remove the stale
label to unmark it. Otherwise, this will be closed in 7 days.
Describe the bug On execution of even a very basic code such as
a error is thrown when python shuts down.
To Reproduce Steps to reproduce the behavior:
Expected behavior Python must terminate without throwing an error