deeplearning4j / deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
http://deeplearning4j.konduit.ai
Apache License 2.0
13.69k stars 3.83k forks source link

DL4J: Move word vectors off-heap for WordVectorSerializer.loadStaticModel #6721

Open AlexDBlack opened 6 years ago

AlexDBlack commented 6 years ago

I thought there was already an issue for this, but I can't find it. Here's what the heap looks like with loading google news vectors via loadStaticModel, with nothing else in the JVM: image That's a total of 67.4 million objects for a vocab of 3 million words.

One option here: https://github.com/OpenHFT/Chronicle-Map If we used chronicle map, it would store vectors off-heap. I was thinking we have a few options here for the internal representation

There's also the idea of using a trie instead of a Map<String,X> as that would give us reduced memory for storing the strings. https://en.wikipedia.org/wiki/Trie

Edit: another thought here is that we often want to do 'batch' lookup. It probably makes sense to do batch lookup and create an INDArray in a single op, rather than do N lookups + combine manually in Java. So storage.get("word1", "word2", "word3") would return an INDArray with shape [3,300] - rather than 3x get ops ([1,300]) + Nd4j.vstack.

Aha! Link: https://skymindai.aha.io/features/DL4J-48

agibsonccc commented 6 years ago

Maybe we could look at mem mapped workspaces?

AlexDBlack commented 6 years ago

Memory mapped files might have a role to play for the underlying storage. The main thing here is to get rid of all the on-heap objects (Strings and INDArrays etc) while still providing for O(1) lookup by String and by index.

AlexDBlack commented 6 years ago

Quick experiment with ChroniceMap: I was hoping this could be an easy drop-in replacement for the current map-based storage, looks promising. https://gist.github.com/AlexDBlack/c64b5b874fe2668563d04d3fc971ddc9

Insertion speed: INDArrays/sec
115000 - 10739.633918565558/sec (overall) - 10752.68817204301/sec (last 100)
116000 - 10740.74074074074/sec (overall) - 10869.565217391304/sec (last 100)
...
2991000 - 17413.427726415313/sec (overall) - 19607.843137254902/sec (last 100)
2992000 - 17413.977743632724/sec (overall) - 19230.76923076923/sec (last 100)

It took about 172 sec to do the conversion on my system... limiting factor is probably the serialization cost. Note it's critical to disable GC here, otherwise GC overhead is a major performance issue.

Edit: 8 threads gets it down to 78.5 seconds to convert... https://gist.github.com/AlexDBlack/0751e94a7c5947c37a69277bed49ea38

AlexDBlack commented 6 years ago

Another interesting observation (cc @raver119) - after the word vectors have been cleared from memory (all references removed + GC'd), here's what things look like: image

That's over 500,000 CUDA pointers in a hash map (cache)... maybe we should reduce the default cache settings to avoid this? 500k objects is still enough to cause noticable garbage collector pressure.

Davidixxus commented 4 years ago

@AlexDBlack Did you make any progress on this?. I'm quite interested.

raver119 commented 4 years ago

Yes, we've implemented all required pre-requisites for this. Next step will be actual vocab re-implementation with native bits within.