astooke / Synkhronos

Extension to Theano for multi-GPU data parallelism
MIT License
20 stars 5 forks source link

Too many open files #6

Closed mharradon closed 7 years ago

mharradon commented 7 years ago

After running for a while:

...
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/function_module.py", line 537, in build_inputs
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/data_builder.py", line 38, in data
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/data_module.py", line 211, in set_value
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/data_module.py", line 102, in _update_array
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/data_module.py", line 111, in _alloc_and_signal
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/data_module.py", line 39, in _alloc_shmem
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/shmemarray.py", line 86, in ShmemRawArray
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/synkhronos/shmemarray.py", line 58, in __init__
OSError: [Errno 24] Too many open files

There's a ton of files in /dev/shm/ - are we leaking file handles? Some reference is being retained?

>ls /dev/shm
...
-rw-------  1 ubuntu ubuntu    1024 May  3 06:34 synk_91173_data_19_1
-rw-------  1 ubuntu ubuntu 4194304 May  3 06:34 synk_91173_data_18_1
-rw-------  1 ubuntu ubuntu    1024 May  3 06:33 synk_91173_data_17_1
-rw-------  1 ubuntu ubuntu 4194304 May  3 06:33 synk_91173_data_16_1
-rw-------  1 ubuntu ubuntu    1024 May  3 06:33 synk_91173_data_15_1
-rw-------  1 ubuntu ubuntu 4194304 May  3 06:33 synk_91173_data_14_1
-rw-------  1 ubuntu ubuntu    1024 May  3 06:33 synk_91173_data_13_1
-rw-------  1 ubuntu ubuntu 4194304 May  3 06:33 synk_91173_data_12_1
-rw-------  1 ubuntu ubuntu    1024 May  3 06:33 synk_91173_data_11_1
-rw-------  1 ubuntu ubuntu 4194304 May  3 06:33 synk_91173_data_10_1
-rw-------  1 ubuntu ubuntu    1024 May  3 06:33 synk_91173_data_9_1
-rw-------  1 ubuntu ubuntu 4194304 May  3 06:33 synk_91173_data_8_1
-rw-------  1 ubuntu ubuntu    1024 May  3 06:33 synk_91173_data_7_1
...

There's approximately twice as many entries as iterations of the function (called build_inputs() about 400 times, each with 2 arguments - now there's a little more than 800 files in there).

It could be that my code is retaining references that need to be killed, not sure.

Thanks!

mharradon commented 7 years ago

I'm currently rerunning with 'del inputs' after I call the function to rule out GC missing it (or some other reference being retained by me). Will report back with results.

mharradon commented 7 years ago

So such luck, still leaking :(

Will try manually closing after calling function.

astooke commented 7 years ago

Interesting! I've never seen this before....did not realize you could hit a limit on this.

Shared memory is created through memmap, which links to a filename. To parse the filenames listed there: "synk", followed by process ID (91173), followed by usages within synk (data), followed by unique data object ID number, followed by another tag which is unique to that data object.

Rather than call build_inputs every time in a loop, build inputs once before the loop. Then inside the loop, use the same synk data objects but set their value to the newly desired ones:

x = synk.data(var=x_var)
while True:
    new_dat = get_new_data()
    x.set_value(new_dat)

Alternatively, if you'd like to write certain entries and the array won't change shape, you can write to it like a numpy array. This might save you a memory copy.

x = synk.data(value=first_data_array)
while True:
    x[:] = get_new_data()

If the new array will not be the same shape as the old one, use the set_value method, and it will take care of it. (If it needs to allocate a bigger array, it will get rid of the old allocation memmap and make a new one, with the same name but with that final tag incremented.)

mharradon commented 7 years ago

Ah, I like your way better!

Managed to fix it by calling free_memory() after calling the function:

inputs = train_func.build_inputs(*inputs)
train_func(*inputs)
for inp in inputs:
  inp.free_memory()

I fixed a few typos in data_module.py to get free_memory() to work - I'll submit a pull request, assuming those are correct.

astooke commented 7 years ago

Were you keeping references around to all the separate synk data objects?
I may have left a leak open when the master process dereferences a synk data object, but the workers are still holding onto it...hmm.. edit: or even in the master, it is held onto even if you drop it..hmm..

mharradon commented 7 years ago

I was not leaving references around that I could find - I was also calling del inputs immediately after the function was called. So I think you may be right.