apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.79k stars 6.79k forks source link

Can't pickle MXNet Modules #8955

Open fedorzh opened 7 years ago

fedorzh commented 7 years ago

Description

Can't pickle mxnet Modules

Environment info (Required)

print pickle.__version__
print mx.__version__
$Revision: 72223 $
0.12.0

Error Message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-024323561f98> in <module>()
      1 import pickle
----> 2 pickle.dumps(mlp_model)

/home/ubuntu/anaconda2/lib/python2.7/pickle.pyc in dumps(obj, protocol)
   1378 def dumps(obj, protocol=None):
   1379     file = StringIO()
-> 1380     Pickler(file, protocol).dump(obj)
   1381     return file.getvalue()
   1382 

/home/ubuntu/anaconda2/lib/python2.7/pickle.pyc in dump(self, obj)
    222         if self.proto >= 2:
    223             self.write(PROTO + chr(self.proto))
--> 224         self.save(obj)
    225         self.write(STOP)
    226 

/home/ubuntu/anaconda2/lib/python2.7/pickle.pyc in save(self, obj)
    329 
    330         # Save the reduce() output and finally memoize the object
--> 331         self.save_reduce(obj=obj, *rv)
    332 
    333     def persistent_id(self, obj):

/home/ubuntu/anaconda2/lib/python2.7/pickle.pyc in save_reduce(self, func, args, state, listitems, dictitems, obj)
    423 
    424         if state is not None:
--> 425             save(state)
    426             write(BUILD)
    427 

/home/ubuntu/anaconda2/lib/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/home/ubuntu/anaconda2/lib/python2.7/pickle.pyc in save_dict(self, obj)
    653 
    654         self.memoize(obj)
--> 655         self._batch_setitems(obj.iteritems())
    656 
    657     dispatch[DictionaryType] = save_dict

/home/ubuntu/anaconda2/lib/python2.7/pickle.pyc in _batch_setitems(self, items)
    667             for k, v in items:
    668                 save(k)
--> 669                 save(v)
    670                 write(SETITEM)
    671             return

/home/ubuntu/anaconda2/lib/python2.7/pickle.pyc in save(self, obj)
    304             reduce = getattr(obj, "__reduce_ex__", None)
    305             if reduce:
--> 306                 rv = reduce(self.proto)
    307             else:
    308                 reduce = getattr(obj, "__reduce__", None)

/home/ubuntu/anaconda2/lib/python2.7/copy_reg.pyc in _reduce_ex(self, proto)
     68     else:
     69         if base is self.__class__:
---> 70             raise TypeError, "can't pickle %s objects" % base.__name__
     71         state = base(self)
     72     args = (self.__class__, base, state)

TypeError: can't pickle module objects

Minimum reproducible example

net = mx.sym.Variable('data')
net = mx.sym.flatten(net)
net  = mx.sym.FullyConnected(net, num_hidden=128)
net = mx.sym.Activation(net, act_type="relu")
net = mx.sym.FullyConnected(net, num_hidden = 64)
net = mx.sym.Activation(net, act_type="relu")
net = mx.sym.FullyConnected(net, num_hidden=10)
net = mx.sym.SoftmaxOutput(net, name='softmax')
mlp_model = mx.mod.Module(symbol=net, context=mx.gpu())
import pickle
pickle.dumps(mlp_model)
edmBernard commented 7 years ago

why did you want to pickle the whole module class ? you can saved your symbol with: https://mxnet.incubator.apache.org/api/python/symbol.html#mxnet.symbol.Symbol.save

fedorzh commented 7 years ago

For example, I use parallel processing to distribute my training jobs. joblib uses pickle in multiprocessing

edmBernard commented 7 years ago

Honestly I don't think module can be pickle. Mxnet have lot's of C++ inside. For multi cpu core processing (if you don't use GPU) Mxnet support configuration with environnement variable: https://mxnet.incubator.apache.org/how_to/env_var.html More you can also use NNPACK to parallelize training operation over cpu.

fedorzh commented 7 years ago

If you take the old interface with mxnet.model, it can be pickled. Training of the model is actually not the longest part of my pipeline (sometimes, adds an insignificant part), a lot of other numpy-based machinery is happening outside of it, and parallelization helps immensely with that - I have to run multiple processes with different seeds

leleamol commented 6 years ago

Proposed Labels:"Feature Request", "Module","Python"

sliawatimena commented 6 years ago

I am using windows 7, anaconda navigator 1.9.2, Python 3.6.6, Jupyter notebook 5.7.0, try to learn code from Gluon crash course chapter 5.

I already add: import pickle

I got stuck at

for data, label in train_data:
    print(data.shape, label.shape)
    break
---
AttributeError                            Traceback (most recent call last)
<ipython-input-8-91a66f98d1d2> in <module>()
----> 1 for data, label in train_data:
      2     print(data.shape, label.shape)
      3     break

E:\Anaconda\envs\mxnet\lib\site-packages\mxnet\gluon\data\dataloader.py in __iter__(self)
    282         # multi-worker
    283         return _MultiWorkerIter(self._num_workers, self._dataset,
--> 284                                 self._batchify_fn, self._batch_sampler)
    285 
    286     def __len__(self):

E:\Anaconda\envs\mxnet\lib\site-packages\mxnet\gluon\data\dataloader.py in __init__(self, num_workers, dataset, batchify_fn, batch_sampler)
    142                 args=(self._dataset, self._key_queue, self._data_queue, self._batchify_fn))
    143             worker.daemon = True
--> 144             worker.start()
    145             workers.append(worker)
    146 

E:\Anaconda\envs\mxnet\lib\multiprocessing\process.py in start(self)
    103                'daemonic processes are not allowed to have children'
    104         _cleanup()
--> 105         self._popen = self._Popen(self)
    106         self._sentinel = self._popen.sentinel
    107         # Avoid a refcycle if the target function holds an indirect

E:\Anaconda\envs\mxnet\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):

E:\Anaconda\envs\mxnet\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):

E:\Anaconda\envs\mxnet\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     63             try:
     64                 reduction.dump(prep_data, to_child)
---> 65                 reduction.dump(process_obj, to_child)
     66             finally:
     67                 set_spawning_popen(None)

E:\Anaconda\envs\mxnet\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

AttributeError: Can't pickle local object 'Dataset.transform_first.<locals>.base_fn'

please help. thanks.

piyushghai commented 6 years ago

@sliawatimena Can you file a separate issue on this repository and also can you provide a minimum reproducible example to help debug this issue ?

From the stacktrace that you've posted, it seems unclear as to where you are using pickle ? Also, are you using pickle.dump or pickle.load ?

sliawatimena commented 6 years ago

Dear @piyushghai,

I just copy from 5. Train the neural network, from step 1 - 5 are okay. In step 6, the error message are as previous post.

From googling results: this looks like a Windows-specific problem with Python multiprocessing and Jupyter Notebook. Please help.

Thanks.

Suryadi

egeaydin commented 6 years ago

This https://github.com/apache/incubator-mxnet/issues/10562 issue helped me.

train_data = gluon.data.DataLoader(
    mnist_train, batch_size=batch_size, shuffle=True, num_workers=0)

Change num_works=4 to num_works=0.

Also, do the same for validation data.

Hope this helps.