PyFilesystem / pyfilesystem

Python filesystem abstraction layer
http://pyfilesystem.org/
BSD 3-Clause "New" or "Revised" License
288 stars 63 forks source link

Current listdirinfo gets info twice with horrible performance when working with big direcotry sets (1000+) #209

Closed mareklabonarski closed 8 years ago

mareklabonarski commented 9 years ago

Hi,

I recently used fs to deal with ftp. Code is very nice and clean, but looks like there is a wrong concept to get directory info, which makes it far far more slower than it could be. First, from what I see, directory list is taken with all the stats together, than the stats are wiped of, and then taken again for each of the directories separately. With directory set 1000+, instead of few commands, there are 1000+ commands sent to ftp which makes a huge overhead. Methods pasted below should show it:

@ftperrors
def listdir(self, path="./", wildcard=None, full=False, absolute=False, dirs_only=False, files_only=False):
    path = normpath(path)
    #self.clear_dircache(path)
    if not self.exists(path):
        raise ResourceNotFoundError(path)
    if not self.isdir(path):
        raise ResourceInvalidError(path)
    paths = self._readdir(path).keys()
    # my comment
    paths = self._readdir(path) - at this point, we actually have everything what is needed in   listdirinfo (paths with its stats), and get it quickly (miliseconds)
    #end of my comment
    return self._listdir_helper(path, paths, wildcard, full, absolute, dirs_only, files_only)

@ftperrors
def listdirinfo(self, path="./",
                      wildcard=None,
                      full=False,
                      absolute=False,
                      dirs_only=False,
                      files_only=False):
    path = normpath(path)
    def getinfo(p):
        try:
            if full or absolute:
                return self.getinfo(p)
            else:
                return self.getinfo(pathjoin(path,p))
        except FSError:
            return {}
    # my comment
    if we take info for each directory separately, it takes horrible time (minutes!) when working with 1000+ directories, and that whats gonna happen below
    #end of my comment
    return [(p, getinfo(p))
                for p in self.listdir(path,
                                      wildcard=wildcard,
                                      full=full,
                                      absolute=absolute,
                                      dirs_only=dirs_only,
                                      files_only=files_only)]

I have modified (created new) listdirinfo for my needs. It looks as below:

def fast_list_dir_info(self, *args...) path = normpath(path) if not ftp.exists(path): raise ResourceNotFoundError(path) if not ftp.isdir(path): raise ResourceInvalidError(path) return ftp._readdir(path)

:)

Would you consider fixing the performance issue in your code?

Best Regards, Marek

willmcgugan commented 9 years ago

Would you might submitting a Pull Request for that?

lurch commented 8 years ago

See also #68 which unfortunately petered out before any conclusion was reached :-/

willmcgugan commented 8 years ago

There will be a good solution to this in fs2.0. The api design in the current version makes it tricky list directories optimally...