ajtulloch / sklearn-compiledtrees

Compiled Decision Trees for scikit-learn
tullo.ch/articles/decision-tree-evaluation/
MIT License
224 stars 37 forks source link

OSError: [Errno 24] Too many open files when RandomForestRegressor has 140 estimators #22

Open ollieglass opened 7 years ago

ollieglass commented 7 years ago

Here's a loop that fits and compiles trees, stepping up the number of estimators each time:

from sklearn import datasets, ensemble
import compiledtrees

data = datasets.load_boston()
X, y = data.data, data.target

for i in range(20, 250, 20):
    print(i)

    model = ensemble.RandomForestRegressor(n_jobs=4, n_estimators=i)
    model.fit(X, y)

    model = compiledtrees.CompiledRegressionPredictor(model)

    h = model.predict(X)

It crashes on 140:

$ python test_script.py 
20
40
60
80
100
120
140
Traceback (most recent call last):
  File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/joblib/_parallel_backends.py", line 344, in __call__
    return self.func(*args, **kwargs)
  File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/compiledtrees/code_gen.py", line 173, in _compile
    _call([CXX_COMPILER, cpp_f, "-c", "-fPIC", "-o", o_f.name, "-O3", "-pipe"])
  File "/Users/ollieglass/code/test_compiled_trees/lib/python3.5/site-packages/compiledtrees/code_gen.py", line 179, in _call
    shell=True, stdout=DEVNULL, stderr=DEVNULL)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 576, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 557, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 947, in __init__
    restore_signals, start_new_session)
  File "/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 1454, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files

This is on mac OS.

I haven't looked into workarounds - perhaps I can increase the number of files that can be open at once. But if there's a way to limit the open files in the library, that would probably be better.

ollieglass commented 7 years ago

I had a look at code_gen.py. Perhaps the CodeGenerator class could build a string instead of opening and writing to a file. When the .file method is called, it could write to a file, close it and return the name.

mwojcikowski commented 7 years ago

On Linux and macOS you have to do simply issue ulimit -n 2048. By design compiling trees consumes 2 * n_trees + 2 open files.

On Windows there is no way to raise the limit globally, but there is an internal solution, which you have to include in your script:

import platform

if platform.system() == 'Windows':
    import win32file
    win32file._setmaxstdio(2048)

I used to write one cpp file, but it didn't work for large forests - especially if you have lots of data and allow for full growth. For my example this translate to 500 .cpp files over 100MB (50GB+ of RAM). Keeping all those files in StringIO's would probably work, although .o files would also still be there, so we would go down to ntrees + 2 open files (assuming we successfully close/delete files after compiling them to .o).

To sum up - I regard it as not an issue, and overcoming it would probably cost a lot of RAM in return, which ultimately is a deal-breaker (at least for me).

ollieglass commented 7 years ago

I see what you mean. I've fixed the problem for myself, like you say, it isn't hard.

I am concerned that users could be put off by this. How about an informative error for them, like this?

class CodeGenerator(object):
    def __init__(self):
        try:
            self._file = tempfile.NamedTemporaryFile(prefix='compiledtrees_', suffix='.cpp', delete=True)
        except OSError as e:
            if e.errno == 24:
                print("Too many open files. Increase limit to 2 * n_trees + 2" \
                    + "(unix / mac: ulimit -n [limit], windows: http://bit.ly/2fAKnz0)", file=sys.stderr)
            raise e

        self._indent = 0

edit: added if

mwojcikowski commented 7 years ago

That might be good solution if e.errno == 24 across platforms. As I remember correctly, on Windows I've got some kind of "Permission Denied" errors, which were terrible to debug...

Although I fear we will catch some false positives.

Also an unittest for that would be usefull (see hints on changing limits on all platforms)