Does decoding process must memory the whole compute graph?

baoy-nlp commented 7 years ago

while decoding process without any need of back propagation, can i set some parameter reduce the memory we need allocate to the dynet?

neubig commented 7 years ago

Conceptually, no, this should not be necessary. In the forward pass of neural networks we can re-use memory for variable that are not going to be referenced anymore. The following page from MxNet has a good explanation of this: http://mxnet.io/architecture/note_memory.html

However, in the current implementation in DyNet this is not supported. I'll put this as an enhancement, as it would be a very nice feature to have.

baoy-nlp commented 7 years ago

OK. Thanks a lot. I'm very looking forward to it.

davidweichiang commented 6 years ago

We're interested in helping with this. What would it entail? Since the memory pools can't free blocks, it seems not trivial.

davidweichiang commented 6 years ago

@aargueta2

neubig commented 6 years ago

Yes, this is non-trivial. Creating a new DyNet memory allocator that just wraps a system call to malloc is easy. It would result in a speed reduction, but if it is not the default, but an option I think this is reasonable.

The second thing to do would be ensuring that garbage collection is done when there are no existing remaining pointers to the memory block in use. This could basically be done by having an overridable delete operation for memory blocks, where the current behavior of "no-op" could be maintained, but if we were using the new memory efficient allocator then it does the appropriate reference counting and deletion when the reference count reaches zero.

One other concern is complexity: one of the nice things about DyNet is that it is not overly complex and bloated. We'd have to make sure that the changes can be made without making too many compromises with respect to simplicity.

So yes, this is non-trivial, but if it seems to be workable I'm happy to help you guys with the design @davidweichiang @aargueta2 (@redpony might also be interested)

As work-arounds that you could do with the existing code (both not so pretty):

you could also use multiple GPUs for search
search for a while, save the search state on CPU, clear the graph, and re-start search again

davidweichiang commented 6 years ago

How about multiple memory pools, is there support for that already? That would allow a kind of manual management (which is sort of what we're doing already):

encoder and attention create nodes in pool E
decoder initial state in pool D
decoder adds one input word, putting intermediate nodes in pool T and the new hidden state/cell in pool D'
free pools D and T
decoder adds one input word, putting intermediate nodes in pool T and the new hidden state/cell in pool D
free pools D' and T goto step 3

davidweichiang commented 6 years ago

Do you have an estimate of how much slowdown there would be from using malloc?

And, do you envision any complications arising from nodes having their argument node pointers invalidated?

neubig commented 6 years ago

Note: the following PR will allow for this https://github.com/clab/dynet/pull/1064

clab / dynet

Does decoding process must memory the whole compute graph? #582