fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
999 stars 353 forks source link

Is calling close() in AbstractBufferedFile.__del__() a good idea? #1685

Open sugibuchi opened 4 days ago

sugibuchi commented 4 days ago
    def __del__(self):
        if not self.closed:
            self.close()

https://github.com/fsspec/filesystem_spec/blob/2024.9.0/fsspec/spec.py#L2055-L2057

We have had this close() call for years in __del__() of AbstractBufferedFile.

I am unsure whether this is a real concern since we have a long history of this practice. But calling close() in __del__(), which needs to access file systems or other resources, can cause some problems in real-world situations.

The main concern is timing when __del__() is called. This is frequently not under our control (some bad frameworks do not close file objects but continue maintaining references to them). It can be at the very last moment in the Python interpreter's shutdown sequence. This can be a problem, particularly in async filesystems using event loops.

Let me demonstrate a toy example.

import asyncio

import fsspec.asyn
from fsspec.asyn import sync_wrapper
from fsspec.implementations.memory import MemoryFileSystem
from fsspec.spec import AbstractBufferedFile

class DummyAsyncFileSystem(MemoryFileSystem):
    def __init__(self, *args, **storage_options):
        super().__init__(*args, **storage_options)
        # fsspec.asyn.AsyncFileSystem does the same
        self.loop = fsspec.asyn.get_loop()

class DummyFile(AbstractBufferedFile):
    def __init__(self, fs, **kwargs):
        super().__init__(fs, **kwargs)
        # Many file class implementations reuse the same event loop used by the fs
        self.loop = fs.loop

    def _fetch_range(self, start, end):
        return []

    async def _flush(self, force=False):
        await asyncio.sleep(1)  # Dummy async operation

    # Typical idiom
    flush = sync_wrapper(_flush)

cache = set() # Cause of the problem, but commonly exists in real world

def run():
    print("starting")
    cache.add(DummyFile(DummyAsyncFileSystem(), path="dummy.txt", mode="wb"))
    print("exiting")

if __name__ == "__main__":
    run()

This code just (1) creates a new dummy file instance and (2) puts it into cache. However, the execution of this code will get stuck when exiting.

To investigate why, add the following print() to __del__() of the file class.

class DummyFile(AbstractBufferedFile):
    ...
    def __del__(self):
        print(fsspec.asyn.loop[0])
        print(fsspec.asyn.iothread[0])
        super().__del__()
starting
exiting
<_UnixSelectorEventLoop running=True closed=False debug=False> # Running... really?
<Thread(fsspecIO, stopped daemon 139767434610240)> # Stopped :(

As you can see, the event loop is marked as "running," but the daemon thread hosting it has stopped. This means the file object was garbage-collected after the interpreter terminated the thread. The sync() will never return as the thread running the event loop has stopped.

The root cause in this example is cache.add(), which creates a reference from the global cache object to the file object. We should not do this, but we can accidentally have this kind of reference chain from global objects to file objects in real-world situations. It will lead to unexpected deadlocks that are difficult to investigate.

I have two proposal:

def _stop_fsspec_async_event_loop():
    loop = fsspec.asyn.loop[0]
    thread = fsspec.asyn.iothread[0]

    if loop:
        if loop.is_running():
            loop.call_soon_threadsafe(loop.stop)
            thread.join()
        if not loop.is_closed():
            loop.close()

atexit.register(_stop_fsspec_async_event_loop)
Exception ignored in: <function DummyFile.__del__ at 0x7f1dffcc2fc0>
Traceback (most recent call last):
  File "/home/.../test.py", line 35, in __del__
  File "/home/.../lib/python3.11/site-packages/fsspec/spec.py", line 2057, in __del__
  File "/home/.../lib/python3.11/site-packages/fsspec/spec.py", line 2035, in close
  File "/home/.../lib/python3.11/site-packages/fsspec/asyn.py", line 122, in wrapper
  File "/home/.../lib/python3.11/site-packages/fsspec/asyn.py", line 77, in sync
RuntimeError: Loop is not running

We immediately got an exception this time. But I believe this is much better than the deadlock.

martindurant commented 1 day ago

As with the partner issue, there are pros and cons. Generally, we want to prevent data loss, so even if this is happening during interpreter shutdown (as opposed to normal object cleanup), we ought to try to write to remote. However, is is possible that the filesystem attribute is gone or indeed that the thread/loop are gone and/or shutdown.

So, perhaps the thing to do, is to try to establish whether the write is possible, and at least do it with a timeout. I'm not sure if that's possible! The atexit hook you propose prevents the hanging, but guarantees data loss. Actually, any call to sync() should check whether the thread/loop is still valid, maybe that's a good start.

sugibuchi commented 17 hours ago

Thank you very much, and I understand your point.

A moderate approach is probably to encourage developers of filesystems to consider this problem and implement something special if necessary. If a file class depends on a vital resource and it can be destroyed before the GC of a file object using it, the developer of the file class should implement something to ensure closing the file before, by using atexit (+WeakSet etc.?) etc.

And,

Actually, any call to sync() should check whether the thread/loop is still valid, maybe that's a good start.

I totally agree. Unfortunately, it looks not so easy to find a thread where a given event loop is running (AFAIK,, there is no public API for this in asyncio). But at least we can check the liveness of fsspec.asyn.iothread[0] in fsspec.asyn.sync(). It would reduce the chance of deadlocks.