Closed tgalery closed 1 year ago
this package supports the specific set of files it supports. And none of those are gzipped. So I do not see the utility.
Text files are quite easy to modify , or even if one manually downloads a pre-trained one and wishes to stick in a docker image but reducing the footprint a bit. This is our use case: minimizing a docker image where a uncompressed file is around 100 megs and a compressed one around 30 megs. Gensim handles this under the hood via smart_open
which can read pretty much anything.
For our use case i think it works if one can pass around an open
function, such as GZip.open
(from https://github.com/JuliaIO/GZip.jl),
function _load_embeddings(::Type{<:FastText_Text}, embedding_file, max_vocab_size, keep_words; open=open)
...
open(embedding_file,"r") do fh
...
disclaimer: i work in the same org as @tgalery
Ah right, I hadn't considered the use case of using this to load manually modified files.
I wonder if instead we should expose a _load_embeddings(::Type{<:FastText_Text}, embedding_filehandle::IO, max_vocab_size, keep_words)
which then the normal one that takes an embedding_file
path, would just do open(...) do
and redispatch.
But then if you wanted to open it with CodeZlib's GZipDecompressorStream
and pass the resulting file handle to _load_embeddings
yourself.
In principle, reading compressed files could be available to anything that is stored in text files, so it could be applicable to data beyond FastText (say Word2vec). We'd be happy to expose a _load_embeddings
method with the signature you provided, we probably would do some refactor to avoid code duplication.
Another option would be just inferring how to open the file based on the file extension. e.g. file_open = endswith(file_path, ".gz") ? GZip.open : open
, but we would be adding an extra dependency in your repo.
Note that @mforets suggested JuliaIO/GZip.jl for the similarity of interface. But we could also use other libs.
We'd be happy to expose a _load_embeddings method with the signature you provided, we probably would do some refactor to avoid code duplication.
Sounds good
Since most word embeddings data is available in textual format and those can be compressed, ideally it would be nice to be able to gzip them and read them compressed (might be useful for reducing docker images etc ... ) If you agree with this, I'm happy to send a PR.