DL4J: Move word vectors off-heap for WordVectorSerializer.loadStaticModel

AlexDBlack commented 6 years ago

I thought there was already an issue for this, but I can't find it. Here's what the heap looks like with loading google news vectors via loadStaticModel, with nothing else in the JVM: That's a total of 67.4 million objects for a vocab of 3 million words.

One option here: https://github.com/OpenHFT/Chronicle-Map If we used chronicle map, it would store vectors off-heap. I was thinking we have a few options here for the internal representation

Map<String,Integer> plus Map<Integer,INDArray> (integers being the index)
Map<String,Integer> plus Map<String,INDArray>
Map<String,Pair<Long,Long>> plus a float pointer (long values are offsets for each word). Might be faster to load due to less preprocessing.

There's also the idea of using a trie instead of a Map<String,X> as that would give us reduced memory for storing the strings. https://en.wikipedia.org/wiki/Trie

Edit: another thought here is that we often want to do 'batch' lookup. It probably makes sense to do batch lookup and create an INDArray in a single op, rather than do N lookups + combine manually in Java. So storage.get("word1", "word2", "word3") would return an INDArray with shape [3,300] - rather than 3x get ops ([1,300]) + Nd4j.vstack.

Aha! Link: https://skymindai.aha.io/features/DL4J-48

agibsonccc commented 6 years ago

Maybe we could look at mem mapped workspaces?

AlexDBlack commented 6 years ago

Memory mapped files might have a role to play for the underlying storage. The main thing here is to get rid of all the on-heap objects (Strings and INDArrays etc) while still providing for O(1) lookup by String and by index.

AlexDBlack commented 6 years ago

Quick experiment with ChroniceMap: I was hoping this could be an easy drop-in replacement for the current map-based storage, looks promising. https://gist.github.com/AlexDBlack/c64b5b874fe2668563d04d3fc971ddc9

Insertion speed: INDArrays/sec
115000 - 10739.633918565558/sec (overall) - 10752.68817204301/sec (last 100)
116000 - 10740.74074074074/sec (overall) - 10869.565217391304/sec (last 100)
...
2991000 - 17413.427726415313/sec (overall) - 19607.843137254902/sec (last 100)
2992000 - 17413.977743632724/sec (overall) - 19230.76923076923/sec (last 100)

It took about 172 sec to do the conversion on my system... limiting factor is probably the serialization cost. Note it's critical to disable GC here, otherwise GC overhead is a major performance issue.

Edit: 8 threads gets it down to 78.5 seconds to convert... https://gist.github.com/AlexDBlack/0751e94a7c5947c37a69277bed49ea38

AlexDBlack commented 6 years ago

Another interesting observation (cc @raver119) - after the word vectors have been cleared from memory (all references removed + GC'd), here's what things look like:

That's over 500,000 CUDA pointers in a hash map (cache)... maybe we should reduce the default cache settings to avoid this? 500k objects is still enough to cause noticable garbage collector pressure.

Davidixxus commented 4 years ago

@AlexDBlack Did you make any progress on this?. I'm quite interested.

raver119 commented 4 years ago

Yes, we've implemented all required pre-requisites for this. Next step will be actual vocab re-implementation with native bits within.

deeplearning4j / deeplearning4j

DL4J: Move word vectors off-heap for WordVectorSerializer.loadStaticModel #6721