Support for Gzipped files

JuliaText / Embeddings.jl

Functions and data dependencies for loading various word embeddings (Word2Vec, FastText, GLoVE)

MIT License

81 stars 19 forks source link

Support for Gzipped files #45

Closed tgalery closed 1 year ago

tgalery commented 1 year ago

Since most word embeddings data is available in textual format and those can be compressed, ideally it would be nice to be able to gzip them and read them compressed (might be useful for reducing docker images etc ... ) If you agree with this, I'm happy to send a PR.

oxinabox commented 1 year ago

this package supports the specific set of files it supports. And none of those are gzipped. So I do not see the utility.

tgalery commented 1 year ago

Text files are quite easy to modify , or even if one manually downloads a pre-trained one and wishes to stick in a docker image but reducing the footprint a bit. This is our use case: minimizing a docker image where a uncompressed file is around 100 megs and a compressed one around 30 megs. Gensim handles this under the hood via smart_open which can read pretty much anything.

mforets commented 1 year ago

For our use case i think it works if one can pass around an open function, such as GZip.open (from https://github.com/JuliaIO/GZip.jl),

function _load_embeddings(::Type{<:FastText_Text}, embedding_file, max_vocab_size, keep_words; open=open)
...
    open(embedding_file,"r") do fh
...

disclaimer: i work in the same org as @tgalery

oxinabox commented 1 year ago

Ah right, I hadn't considered the use case of using this to load manually modified files.

I wonder if instead we should expose a _load_embeddings(::Type{<:FastText_Text}, embedding_filehandle::IO, max_vocab_size, keep_words) which then the normal one that takes an embedding_file path, would just do open(...) do and redispatch. But then if you wanted to open it with CodeZlib's GZipDecompressorStream and pass the resulting file handle to _load_embeddings yourself.

tgalery commented 1 year ago

In principle, reading compressed files could be available to anything that is stored in text files, so it could be applicable to data beyond FastText (say Word2vec). We'd be happy to expose a _load_embeddings method with the signature you provided, we probably would do some refactor to avoid code duplication.

Another option would be just inferring how to open the file based on the file extension. e.g. file_open = endswith(file_path, ".gz") ? GZip.open : open, but we would be adding an extra dependency in your repo.

Note that @mforets suggested JuliaIO/GZip.jl for the similarity of interface. But we could also use other libs.

oxinabox commented 1 year ago

We'd be happy to expose a _load_embeddings method with the signature you provided, we probably would do some refactor to avoid code duplication.

Sounds good