flying-circus / pyfilesystem

Automatically exported from code.google.com/p/pyfilesystem
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

fs commands call listdir more often than needed #68

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Please provide any additional information below.
While debugging FTPFS using the fsls command, I noticed that listdir() was 
being called twice, when I only expected it to be called once.
So I looked into the problem, and found that FSls.do_run() had two separate 
calls to FS.listdir() - once to get the files, and once to get the directories. 
For high-latency connections, this means fsls will take twice as long as it 
needs to!

I've fixed the problem by making one call to FS.listdir() and then borrowing 
the filtering logic from FS._listdir_helper()

I had a look to see if any of the other "commands" had the same problem, and 
noticed that Command.get_resources() was doing something similar, so I've 
managed to fix that too.

See attached patch :)

Original issue reported on code.google.com by gc...@loowis.durge.org on 20 May 2011 at 1:15

Attachments:

GoogleCodeExporter commented 9 years ago
That is a shortcoming in the api, in that you can't get a directory listing and 
be able to distinguish files from directories in one call.

However when it comes to ftpfs there is a directory caching mechanism so that 
even though there are two calls to listdir, it should only make a single 
request to the server.

I'm afraid your patch would likely make things slower in most cases. It calls 
isdir and isfile for each file in a directory listing -- on some 
implementations (including ftpfs with cache hinting set to false) that would 
cause a request per file...

Original comment by willmcgugan on 20 May 2011 at 1:36

GoogleCodeExporter commented 9 years ago
But in the existing FSls code, it calls listdir twice, once with dirs_only set 
to True, and once with files_only set to True. Which means inside 
_listdir_helper there's also going to be multiple calls to isdir and isfile... 
so I'm fairly sure that my patch is always an improvement.

Can you provide a concrete example of where my patch is slower than the 
existing code?

And the second line in FTPFS.listdir() says self.clear_dircache(path), so each 
listdir call *does* make a separate server request, in spite of the caching 
(but isdir and isfile do use the cache).

Original comment by gc...@loowis.durge.org on 20 May 2011 at 2:03

GoogleCodeExporter commented 9 years ago
...maybe there should be a --no-colours option to fsls, so that it doesn't 
always have to distinguish between files and directories? ;)

Original comment by gc...@loowis.durge.org on 20 May 2011 at 2:13

GoogleCodeExporter commented 9 years ago
I don't recall why that clear_dircache call was there. May have been a fix for 
some issue, I'll take a look at that. It's possible that mechanism is broken.

What was intended though is that if you have done cache_hint(True) on the ftpfs 
then two calls two listdir won't cause two requests to the server.

Bear in mind that the commands are intended to run on many filesystems. And in 
the general case 2 calls to listdir is better than 1 call to listdir and n 
calls to isfile/isdir. 

Original comment by willmcgugan on 20 May 2011 at 2:47

GoogleCodeExporter commented 9 years ago
Forgot to add, that calls to isdir and isfile, inside a call to listdir will 
definitely be cached. But when called directly they won't be cached if 
cache_hint is False...

Original comment by willmcgugan on 20 May 2011 at 3:30

GoogleCodeExporter commented 9 years ago
Maybe you're confusing FTPFS.listdir (which does clear the cache) with 
FTPFS._readdir (which doesn't clear the cache) ?

I found a bug in my fsls.py fix in the way it handled the --full option. Fixed 
in the revised attached patch.

> And in the general case 2 calls to listdir is better than 1 call to listdir 
and n calls to isfile/isdir.

And as I already pointed out, the 2 calls to listdir in FSls.do_run make use of 
the dirs_only and files_only flags, each of which will cause _listdir_helper to 
itself call isfile and isdir n times - so AFAICT the current code makes double 
the number of calls to isfile/isdir than my patched code, regardless of any 
caching.
Unless I'm overlooking something? :-S

Original comment by gc...@loowis.durge.org on 20 May 2011 at 3:37

Attachments:

GoogleCodeExporter commented 9 years ago
It looks like the clear_dircache in listdir isn't required. Feel free to 
comment that out until I push a change.

You're right that listdir_helper does call isfile/isdir, but in that context 
they can work with cached directory information. And the implementation is also 
free to do something more efficient.

With your patch though, isfile/isdir are called directly and may not be able to 
use cached directory information. For a network filesystem, that could mean 
many more requests to the server

Have a look at the ftperrors decorator, particularly the call to 
_enter_dircache and _leave_dircache

I'm not being combative -- patches are always welcome. It's just that I don't 
want to accept a small optimization for one filesystem that slows down other 
filesystems.

If you could confirm that the patch doesn't negatively effect the performance 
of rpcfs, sftpfs and s3fs then I'd consider it... 

Original comment by willmcgugan on 20 May 2011 at 4:02

GoogleCodeExporter commented 9 years ago
Okay, I've come up with an alternative implementation which addresses your 
concerns :)

How does this new patch grab you?

Original comment by gc...@loowis.durge.org on 21 May 2011 at 4:32

Attachments:

GoogleCodeExporter commented 9 years ago
While investigating a separate problem, I've just found that in order for my 
most recent patch to work "properly", I also need to add a listdir2 method to 
WrapFS (if I don't, then trying to "fsls" a FTP subdirectory means that the 
SubFS calls base.listdir2 and bypasses my ftpfs.listdir2).

The WrapFS implementations of e.g. hassyspath and isfile are really easy to 
follow, but the WrapFS implementation of listdir goes completely over my head, 
so I don't know how to implement WrapFS.listdir2 :-(
Help please? ;)

Alternatively, is there any way to make the BaseFS / WrapFS / SubFS interaction 
less brittle?

Original comment by gc...@loowis.durge.org on 30 May 2011 at 7:02

GoogleCodeExporter commented 9 years ago
Hmmm, I'm not sure if it's "correct", but this implementation seems to work...

Original comment by gc...@loowis.durge.org on 30 May 2011 at 7:18

Attachments:

GoogleCodeExporter commented 9 years ago
I'm not too keen on the 'listdir2' method. The concept is fine, but appending a 
2 to a method name will only lead to confusion. People will assume it 
supersedes 'listdir' and deprecates it.

Also bear in mind that the classes in 'fs.expose' will all have to be updated 
so they can use the new method.

Could you hold off on committing this one. I'd like to get 0.4 out without any 
API changes. Which I plan to release as soon as I can figure out why serving 
sftp is breaking on Py2.7

Since this modifies the base API, we should really hammer this out on the 
mailing list. Just to be sure we've exhausted all possibilities.

Original comment by willmcgugan on 30 May 2011 at 10:17

GoogleCodeExporter commented 9 years ago
(sorry I haven't been following this issue, so I might be out of the loop a bit)

Does FTPFS provide an efficient listdirinfo() implementation?  If so, you can 
use the following idiom to distinguish between files and directories with a 
single call to listdir::

    for (path,info) in myfs.listdirinfo("/some/path",absolute=True)
        if fs.util.isdir(myfs,path,info):
            print "it's a dir"
        else:
            print "it's a file"

This assumes that listdirinfo() makes a single call to the server, and that the 
info dicts provide an "st_mode" member with the appropriate bits for files and 
directories.

Original comment by rfkel...@gmail.com on 30 May 2011 at 10:57

GoogleCodeExporter commented 9 years ago
Okay, discussion moved to the mailing list... :)

Original comment by gc...@loowis.durge.org on 31 May 2011 at 1:03

GoogleCodeExporter commented 9 years ago
For anyone playing catch-up:
http://groups.google.com/group/pyfilesystem-discussion/browse_thread/thread/a602
f7b9c3110af2

Original comment by gc...@loowis.durge.org on 20 Sep 2011 at 5:48