Reuse executor for repeated prediction with different checkpoints

dmlc / MXNet.jl

MXNet Julia Package - flexible and efficient deep learning in Julia

371 stars 70 forks source link

Reuse executor for repeated prediction with different checkpoints #84

Open vchuravy opened 8 years ago

vchuravy commented 8 years ago

Testing the different checkpoints of a training run requires loading checkpoints and predictions runs in a tight loop.

net = mx.load(archfile, mx.SymbolicNode)
arch = mx.FeedForward(net, context=mx.gpu())

data = ...

for wfile in weights
  saved_dict = mx.load(wfile, mx.NDArray)
  arg_params = Dict{Base.Symbol, mx.NDArray}()
  aux_params = Dict{Base.Symbol, mx.NDArray}()
  for (k,v) in saved_dict
    tp, name = split(string(k), ':')
    name = symbol(name)
    if tp == "arg"
      arg_params[name] = v
    else
      aux_params[name] = v
    end
  end

  arch.arg_params = arg_params
  arch.aux_params = aux_params

  pred = mx.predict(arch, data)
  # For memory reclaim, eagerly finalize self.pred_exec.handle
  finalize(arch.pred_exec.handle)
  arch.pred_exec=nothing
end

Without these final two lines this easily runs out of memory for big models/batch_sizes, because the executor is not gc'd yet and we are creating a new one asking for more memory.

If we could reuse the previous executor that problem would be alleviated.

pluskid commented 8 years ago

Yes the GC problem is generally painful here. I am wondering if there is (or is going to be) any good ways in Julia to manage external resources.

vchuravy commented 8 years ago

I like the way files are handled with do...end blocks, but I don't quite now how that would pan out in terms of memory managements and normally one shouldn't create and destroy executors in a loop.

Ideally we would either reuse the current executor or free the previous one before creating a new one here: https://github.com/dmlc/MXNet.jl/blob/a2164ae43ab70d8be7708b7dc9974a5a6a360a8e/src/model.jl#L131

pluskid commented 8 years ago

do ... end is very limited and not easy to use in many cases.

For this particular case, I think it goes to the else branch and re-use executor unless overwrite is true: https://github.com/dmlc/MXNet.jl/blob/a2164ae43ab70d8be7708b7dc9974a5a6a360a8e/src/model.jl#L135

vchuravy commented 8 years ago

Which it is: https://github.com/dmlc/MXNet.jl/blob/a2164ae43ab70d8be7708b7dc9974a5a6a360a8e/src/model.jl#L186

I was wondering if it would be possible to use the same executor and update the weights of the model?

pluskid commented 8 years ago

Oh, I see. Yes, it would be a good idea to add some function like sync_params to copy the parameters over. A yet better way is the module system recently introduced in the python side. Essentially the same executors are used for both training and prediction. Data-parallelism could be supported in prediction in that case. But that definitely requires a fair amount of refactoring and porting.