Raynes / fs

File system utilities for Clojure.
454 stars 119 forks source link

iterate-dir throws OOME on large directory structures #38

Open pmonks opened 11 years ago

pmonks commented 11 years ago

The iterate-dir function consumes all available heap and throws an OOME on large directory structures.

The following typescript demonstrates this problem in a couple of different ways when the function is presented with a directory containing approximately 600,000 files & sub-directories (note: embedded ANSI escape characters have been manually removed from this typescript for clarity):

Script started on Thu Jan  3 22:45:39 2013
bash-3.2$ ls -R /Users/pmonks/Development | wc -l
ls: unreadableDirectory: Permission denied
  614630
bash-3.2$ lein repl
nREPL server started on port 52181
REPL-y 0.1.0-beta10
Clojure 1.4.0
    Exit: Control+D or (exit) or (quit)
Commands: (user/help)
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
  Source: (source function-name-here)
          (user/sourcery function-name-here)
 Javadoc: (javadoc java-object-or-class-here)
Examples from clojuredocs.org: [clojuredocs or cdoc]
          (user/clojuredocs name-here)
          (user/clojuredocs "ns-here" "name-here")
fs-scan.core=> (require '[fs.core :as fs])
nil
fs-scan.core=> (defn walker [root dirs files] ())
#'fs-scan.core/walker
fs-scan.core=> (fs/walk walker "/Users/pmonks/Development")
OutOfMemoryError Java heap space  java.util.Arrays.copyOf (Arrays.java:2882)

fs-scan.core=> (fs/iterate-dir "/Users/pmonks/Development")
OutOfMemoryError Java heap space  java.util.Arrays.copyOf (Arrays.java:2882)

fs-scan.core=> (do (fs/iterate-dir "/Users/pmonks/Development") ())
OutOfMemoryError Java heap space  java.util.Arrays.copyOf (Arrays.java:2882)

fs-scan.core=> exit
Bye for now!

bash-3.2$ exit
exit

Script done on Thu Jan  3 22:53:42 2013

I believe this is occurring because iterate-dir is not lazy (despite the doc comment), and is eagerly building the entire sequence of pathnames in memory.

pmonks commented 11 years ago

For my use case, this issue appears when using the walk function. Basically I want to be able to walk very large directory structures (10s to 100s of millions of files, transitively), processing as I go.

Raynes commented 11 years ago

I see. The problem is that the zipper used under the hood holds the whole tree in memory. I'll get a fix in asap. Should just be a tree-seq (I didn't write this code. I never write code that blows the heap, you see ;)).

pmonks commented 11 years ago

;-)

Thanks for the lickety-split response - I'll keep an eye out for the update and give the new version a whirl.