Open AlexDBlack opened 6 years ago
Maybe we could look at mem mapped workspaces?
Memory mapped files might have a role to play for the underlying storage. The main thing here is to get rid of all the on-heap objects (Strings and INDArrays etc) while still providing for O(1) lookup by String and by index.
Quick experiment with ChroniceMap: I was hoping this could be an easy drop-in replacement for the current map-based storage, looks promising. https://gist.github.com/AlexDBlack/c64b5b874fe2668563d04d3fc971ddc9
Insertion speed: INDArrays/sec
115000 - 10739.633918565558/sec (overall) - 10752.68817204301/sec (last 100)
116000 - 10740.74074074074/sec (overall) - 10869.565217391304/sec (last 100)
...
2991000 - 17413.427726415313/sec (overall) - 19607.843137254902/sec (last 100)
2992000 - 17413.977743632724/sec (overall) - 19230.76923076923/sec (last 100)
It took about 172 sec to do the conversion on my system... limiting factor is probably the serialization cost. Note it's critical to disable GC here, otherwise GC overhead is a major performance issue.
Edit: 8 threads gets it down to 78.5 seconds to convert... https://gist.github.com/AlexDBlack/0751e94a7c5947c37a69277bed49ea38
Another interesting observation (cc @raver119) - after the word vectors have been cleared from memory (all references removed + GC'd), here's what things look like:
That's over 500,000 CUDA pointers in a hash map (cache)... maybe we should reduce the default cache settings to avoid this? 500k objects is still enough to cause noticable garbage collector pressure.
@AlexDBlack Did you make any progress on this?. I'm quite interested.
Yes, we've implemented all required pre-requisites for this. Next step will be actual vocab re-implementation with native bits within.
I thought there was already an issue for this, but I can't find it. Here's what the heap looks like with loading google news vectors via loadStaticModel, with nothing else in the JVM: That's a total of 67.4 million objects for a vocab of 3 million words.
One option here: https://github.com/OpenHFT/Chronicle-Map If we used chronicle map, it would store vectors off-heap. I was thinking we have a few options here for the internal representation
Map<String,Integer>
plusMap<Integer,INDArray>
(integers being the index)Map<String,Integer>
plusMap<String,INDArray>
Map<String,Pair<Long,Long>>
plus a float pointer (long values are offsets for each word). Might be faster to load due to less preprocessing.There's also the idea of using a trie instead of a
Map<String,X>
as that would give us reduced memory for storing the strings. https://en.wikipedia.org/wiki/TrieEdit: another thought here is that we often want to do 'batch' lookup. It probably makes sense to do batch lookup and create an INDArray in a single op, rather than do N lookups + combine manually in Java. So storage.get("word1", "word2", "word3") would return an INDArray with shape [3,300] - rather than 3x get ops ([1,300]) + Nd4j.vstack.
Aha! Link: https://skymindai.aha.io/features/DL4J-48