Closed carschno closed 7 years ago
This is actually more tricky than I assumed. Any hints on what might be the most efficient (and practical) approach to store <token> <double[]>
'documents'? The <token>
field/column could serve as unique id.
This is possible in Lucene, but it is not really designed for that purpose: it requires a field for each double value in the array, and we don't need any search functionality. What might be a more suitable alternative that ideally does not require the installation of a local server? Any experience? What might be worth a look? SQLite, MongoDB, ...? Stick with Lucene?
On disk? But you still want to be able to do a lookup based on the token?
Yes. The idea is to cache the entries in a (fast) index on disk if there are too many to store in memory. Currently, they are stored in a HashMap, but this might become too large with potentially millions of entries.
There is an implementation of a token/vector store based in BerkeleyDB in DKPro Similarity (https://github.com/dkpro/dkpro-similarity/tree/master/dkpro-similarity-algorithms-vsm-asl/src/main/java/dkpro/similarity/algorithms/vsm/store/vectorindex) - but last time I tried that to store topic models, it was actually less space-efficient than a simple text file, despite smartly switching between sparse and dense representations and despite using compression internally - no idea why.
So, you could try that - or - invent your own file format. E.g. a binary format like
DataInputStream and DataOutputStream are good candidates to operate on binary files.
Sounds promising; the only thing that worries me is the "sorted index": how to (quickly) look up a certain entry without loading the whole list into memory? A skip list might be a way.
I assume that the vectors take the bulk of the memory on disk. The index would be much smaller, requiring to keep only the tokens and their addresses in the file in memory. A trie or FST might be suitable to keep the memory requirements low. LanguageTool uses https://github.com/morfologik/morfologik-stemming to maintain their dictionaries. I don't know how flexible the library is, but I would guess that at least the token->address lookup could be efficiently handled through it.
A solution based on loading the file using Mmap looks most promising.
I actually meanwhile have such an implementation... it's based on an initial implementation by @treo. I can commit that to DKPro Core if you want. @treo licensed the original sources under ASL and made quite a few changes, so should not be a problem to continue evolving it here for the moment.
Please do, sounds promising!
A quick rundown on the implementation I made for storing word embeddings:
It is a binary format that can be directly mmap'ed, The files use a header in the following format:
[Magic Bytes: "dl4jw2v" ASCII encoded][Format Version, 1 byte integer][Token Number, 4 byte Integer][Embedding Size, 4 byte Integer][token section length, 4 byte integer][Uses Doubles?, 1 byte boolean]
The header is followed by all tokens, which are written as [Length, 4 byte integer][Token, dynamic width, encoded as UTF-8]
and then all embeddings simply as [Embedding, given width in floats times 4 bytes]
The tokens are sorted alphabetically, and the vectors are written in the same order. When the file is loaded, only the tokens themselves need to be held in memory for efficient position calculation (via binary tree search). This position is then used to directly address the corresponding vector.
Java has a limitation that it can't directly address mmap'ed files that are larger than 2 GB (due to its integer size limit), but I worked around that by chunking the access.
Not the limitations of this file format and java is that you can't have more then 2^31-1 (i.e. about 2,15 billion) tokens. But also, all of your tokens together along with their sizes (i.e. the token section) may not be larger than about 2GB. (Also you can not have embeddings that have more than 2^31-1 elements, but that would be impractical anyway).
While those limitations are real, in practice they may be not relevant. The file size of my german wikipedia based corpus is about 5.5GB and it contains 4.3 million embeddings with a vector size of 300. On a very fast SSD (300k IOPS read), it takes (on average) only slightly longer to access that data than it takes to access the same data from memory.
So It looks like it may be appropriate for your use, not sure how @reckart has evolved it since then, but I guess adding some more information about the original version should help, especially since I originally gave him just the code.
As an empty or non-existent file isn't a properly written word embedding file, I think this should be either an assertion error or some kind of argument exception.
@treo pointed out in chat that writing regular UTF-8 would again (as in his original implementation) would also be more sensible with respect to portability to other languages than using Java writeUTF which produces modified UTF-8.
@treo, @reckart I'm taking care of both your suggestions, thanks! In fact, I have already changed the log level from warning to error for the case of empty input maps, but I agree that this should raise an exception.
@treo Also, generally feel free to commit pull requests if you like.
With regard to encoding: I am replacing all calls to writeUTF(s)
with write(s.getBytes(charset))
where charset
points to Charset.forName("UTF-8")
.
Question is: what is the best method in Java to read unmodified UTF-8?
I am replacing all calls to writeUTF(s) with write(s.getBytes(charset)) where charset points to Charset.forName("UTF-8")
That isn't exactly equivalent. The equivalent version would be along the lines of:
byte[] bytes = s.getBytes(charset);
write(bytes.length);
write(bytes);
When reading it, you would read the length first, and then read as many bytes as needed and create a new String with the UTF-8 Charset.
You can see an example of that in the original gist https://gist.github.com/treo/f5a346d53f89566b51bf88a9a42c67c7#file-binarywordvectorserializer-java-L90
Just that it reads from a memory mapped channel there.
@treo Thanks, that does make a lot of sense. I wonder though: is your suggestion UTF-8? I am not familiar with encoding details, but writing the length as an integer looks different from the UTF-8 description. Am I right? If so, it seems to me like we could as well stick with modified UTF-8 with regard to standards compatibility.
I'm writing the number of bytes there, not the string length. So if there are any multi byte characters, it will account for that.
Edit: @carschno UTF-8 only describes the encoding of the individual characters, we have to write a whole string here. So writing how many bytes the following string has is outside of its scope. The problem with modified UTF-8, as written by writeUTF, is that it encodes the characters differently from normal UTF-8.
Currently, all the word embeddings read by
WordEmbeddingsAnnotator
(#798) are held in memory. In case memory is not big enough, there should be an option to store them in a Lucene index (or some other document store?) on disk.