fsspec / sshfs

sshfs - SSH/SFTP implementation for fsspec
Apache License 2.0
58 stars 14 forks source link

Initialisation seems to maintain cached filesystem #42

Open benrutter opened 7 months ago

benrutter commented 7 months ago

I'm not 100% sure that the title here, is accurate, as it involves a bit more understanding of what's happening under the hood with asyncssh than I have so far.

I'm also not sure if this is intended behaviour vs actually a bug (sorry!)

The issue is something like this:

fs = SSHFileSystem(host, username=username, password=password)
for filepath in long_list_of_files:
    with fs.open(filepath) as file:
        _ = file.read()

If this runs for a long time, the connection might be shut down from the other side throwing up an asnycssh.sftp.SFTPNoConnection error, so far this is all as expected.

The bit that seems unusual is that something like this:

fs = SSHFileSystem(host, username=username, password=password)
for filepath in long_list_of_files:
    try:
        with fs.open(filepath) as file:
            _ = file.read()
    except SFTPNoConnection:
        new_fs = SSHFileSystem(host, username=username, password=password)
        with new_fs.open(filepath) as file:
            _ = file.read()

The new_fs will always throw up the same SFTPNoConnection error, which seems to be because something behind the scenes is being cached?

Notably, the following works by clearing the cache before reconnecting:

fs = SSHFileSystem(host, username=username, password=password)
for filepath in long_list_of_files:
    try:
        with fs.open(filepath) as file:
            _ = file.read()
    except SFTPNoConnection:
        fs.clear_instance_cache()
        new_fs = SSHFileSystem(host, username=username, password=password)
        with new_fs.open(filepath) as file:
            _ = file.read()

I'd assume the expected behaviour would be that initialising a new SSHFileSystem would create a fully new connection - is this intentional behaviour?

benrutter commented 7 months ago

Should also mentioned that there's a stack overflow issue related to this

benrutter commented 7 months ago

I've done some playing around and noticed the connection isn't closed during finalize is that intentional? Also, is there any reason not to put a self.clear_instance_cache() in that finalize step? (I feel like there probably is, I'm mainly just interested!)

fkrauthan commented 2 weeks ago

We discovered the same issue. Because of the cached file system this library actually leaks connections as every time you initialize an instance of SSHFileSystem it connects to the SFTP server while all method calls use the very first instance only.

Given that the library opens a new connection as part of the constructor it probably should overwrite the cachable attribute of AbstractFileSystem and set it to False.

A workaround until that is done:

class SSHFileSystemNoCache(SSHFileSystem):
    cachable = False

and then use for your code the SSHFileSystemNoCache class instead.

Also it would be great to explicit be able to close the connection. Currently I use a sf.client.close() to enforce that after I am done using my fs instance.