benhoyt / scandir

Better directory iterator and faster os.walk(), now in the Python 3.5 stdlib
https://benhoyt.com/writings/scandir/
BSD 3-Clause "New" or "Revised" License
532 stars 68 forks source link

is there any way to read all file names in the sub directory? #125

Closed chanduthedev closed 4 years ago

chanduthedev commented 4 years ago

Hi @benhoyt,

Thank you for the very useful library. It really saved lot of time for us. I have a small query on reading files. Below is my use case.

I have around 4million folders in master folder and each folder has 2-3 files in each folder for all 4 million folders. I would like to read all file names in efficient way. Can you please advise some pointers on this.

I used scandir library to read all 4 million folder names in less than a minute.

Thanks in advance.

benhoyt commented 4 years ago

Hi @chanduthedev - scandir should be able to do this quite easily. If you know that the depth is always 2 (one master folder and sub-folders), then you can make a function that does a scandir loop within a scandir loop: scan_2_deep() in the example below. If the depth is variable, you'd have to make some kind of recursive function: one such function that scans n deep is shown in the example as scan_n_deep(). Here's a Python 3 program showing this:

import os
import sys

def scan_2_deep(path):
    for entry in os.scandir(path):
        if not entry.is_dir():
            # Skip non-directories in master folder
            continue

        for sub_entry in os.scandir(entry.path):
            if not sub_entry.is_file():
                # Skip non-files in sub-folder
                continue

            print(sub_entry.path)

def scan_n_deep(path, n):
    for entry in os.scandir(path):
        if entry.is_dir():
            if n > 1:
                scan_n_deep(entry.path, n-1)
        elif n == 1:
            print(entry.path)

#scan_2_deep(sys.argv[1])

scan_n_deep(sys.argv[1], int(sys.argv[2]))

Does that help?

chanduthedev commented 4 years ago

Thank you for the quick reply @benhoyt. This scan_2_deep method basically iterate all the folders one by one rite. Same logic I implemented before posting this question. It was taking around 13-14mins to read all ~4million files from 2million folders.

I also tried with your method scan_2_deep method, it also took around 13mins 6 secs. where as existing (paths.list_images() in python) method i used took 14mins 21sec.

There is very less much improvement of iterating through over folders than paths.list_images(). I will be using paths.list_images(). Thank you so much for your quick response.

benhoyt commented 4 years ago

Sounds good!