JuliaText / Embeddings.jl

Functions and data dependencies for loading various word embeddings (Word2Vec, FastText, GLoVE)
MIT License
81 stars 19 forks source link

A more featureful API #16

Open oxinabox opened 6 years ago

oxinabox commented 6 years ago

Shifting discussion form #14

dellison commented 6 years ago

Here are a couple of my initial thoughts.

I think one feature that any reasonable API will require is a fast way to go from word (to index) to vector. Currently this mapping is implicit, but walking through the word list for every vector lookup would be far too slow. The simplest way to do this would be to just add a Dict{String, Int} field to each of the EmbeddingSystem subtypes.

More generally, I think that it would be best to keep the interface minimal and "unopinionated." Embeddings.jl already seems to embrace this. I think that a Base.get-like interface would be best for up embeddings with a word key. This isn't real code, but I'm imagining something along the lines of:

julia> w2v = load_embeddings(Word2Vec{:en}, args...)

julia> word_vector(w2v, "the")
300-element Array{Float64,1}:
 0.1  
 0.2
...

julia> word_vector(w2v, "notfound")
Error: "notfound" not in vocabulary

julia> oov_vector = zeros(size(w2v.embedding_matrix, 1))
300-element Array{Float64,1}:
 0.0  
 0.0

julia> word_vector(w2v, "notfound", oov_vector)
300-element Array{Float64,1}:
 0.0  
 0.0
...

julia> word_vector(() -> randn(300), w2v, "notfound")
300-element Array{Float64,1}:
 -1.784924003687886   
  1.3056642179917306  

# implementing that method for each type would also permit this syntax:
julia> vec = word_vector(w2v, "notfound") do
          # do some fancy calculation
          return vector
      end

I believe an approach like this would work well alongside system-specific needs of things like OOV interpolation as well. Maybe the default behavior for word_vector(w2v, "notfound") throws an error for Word2Vec for word_vector(ft::FastText, "notfound") computes a vector and returns it. A one-function interface like this would allow this kind of flexibility, I think. In general, I think we should try to have a common interface for looking up vectors that makes as few assumptions as possible. After all, embeddings are pretty much always a very small component of a larger model, so making Embeddings.jl small and composable is what we should aim for, I think.

oxinabox commented 6 years ago

Some very scatted thoughts:

What I currently do, when using this in the wild, is to use MLLabelUtils https://github.com/JuliaML/MLLabelUtils.jl

using Embeddings
using MLDataUtils

o_embedding_table = load_embeddings(Word2Vec) 

# Add OOV zero vector
const embedding_table = Embeddings.EmbeddingTable(
    [o_embedding_table.embeddings zeros(300)],
    [o_embedding_table.vocab; "<OOV>"]
)

const enc = LabelEnc.NativeLabels("<OOV>", embedding_table.vocab)
to_ind(lbl, enc=enc) = convertlabel(LabelEnc.Indices, lbl , enc)

See in https://github.com/oxinabox/AuthorGenderStudy.jl/blob/master/src/proto.ipynb

When working with this in the something, it is often good to covert all your words into integers, because that takes up less space. So MLLabelUtils is good for that. And the encoder I showed above converts words to indexes, and handles OOV as well. (if an OOV label wasn't provided it would error.

Two advantages: point 1) is the indexing with Ints can be performed trivially in all systems. point 2) is that it saves memory, since it removes duplicates. And lets you do Integer comparisons rather than string comparisons (so faster).

E.g. on point 1) When working with say TensorFlow.jl you really do want to just stick the embedding matrix into a Tensor (so it can move to the GPU), and index it it with Ints. E.g. https://github.com/oxinabox/ColoringNames.jl/blob/abff47ea4db3a3137cb95ae2ee92d11d4a68d028/src/networks/networks_common.jl#L136-L137

But in, e.g. Flux, we might want to be able to use Strings. Since it doesn't have any real limmittations. Though the Matrix would have to be converted to a Tracked structure to be bale to fine tune it in training. But that could be a Dict{String, TrackedVector}

On point 2. InternedStrings.jl solves this also. (https://github.com/JuliaString/InternedStrings.jl)


More thoughts

We can do dispatches for both AbstractString and index (which would be anything else. Mostly Int / Colon but potentially Vector{Bool} etc).

And it is probably fine to, if given a string, automatically convert it to a vocab index.

WE also have the ability to overload both call syntax and indexing syntax. Not entirely sure if it is useful, but it might be? Like embs("foo") could be different from embs["foo"]

zgornel commented 6 years ago

Any thoughts on OOV interpolation? I was thinking of StringDistances.jl but it may be too slow.

Regarding the embeddings format, the simplest data structure possible is preferable in my case (i.e. Dict{AbstractString, AbstractVector} or the current EmbeddingTable) with conversion methods to support required formats for other packages to_tensorflow_embeddings(embeddings) and overloaded indexing syntax (over call sytax).

oxinabox commented 6 years ago

When I say OOV interpolation, I mean for something like FastText, which is a factored model, and can determine word embeddings using it's ngram embeddings.

I don't mean just finding the nearest word according to Levenshtein Distance, and returning that.

OOV interpolation is a problem that needs to be solved in the numerical embedding space, not in the natural language space.

Preprocessing to correct misspellings is beyond the scope of this package.

zgornel commented 6 years ago

Makes sense, thank you. Any plans for including that in Embeddings.jl ?

oxinabox commented 6 years ago

FastText interpolation? Probably, I think it is basically free once #1 is done.

dellison commented 6 years ago

When working with this in the something, it is often good to covert all your words into integers, because that takes up less space. So MLLabelUtils is good for that. And the encoder I showed above converts words to indexes, and handles OOV as well. (if an OOV label wasn't provided it would error.

Two advantages: point 1) is the indexing with Ints can be performed trivially in all systems. point 2) is that it saves memory, since it removes duplicates. And lets you do Integer comparisons rather than string comparisons (so faster).

Yes, completely agreed on all points. MLDataUtils is definitely great, and does this well. My feeling is that because this word -> int conversion is so fundamental to using word embeddings, it probably makes sense for Embeddings.jl to provide a lightweight way to do it. If this functionality does become part of Embeddings.jl, I'd expect it to work like that.

Regarding using Embeddings.jl alongside TensorFlow.jl or Flux, yes, I think it makes sense to have any API be generic and flexible enough to work well with both.

But in, e.g. Flux,

we might want to be able to use Strings. Since it doesn't have any real limmittations. Though the Matrix would have to be converted to a Tracked structure to be bale to fine tune it in training. But that could be a Dict{String, TrackedVector}

Maybe I miss your point, but I think Flux's param function does exactly that. Flux.param(w2v.embedding_table) would create an embedding matrix that can easily indexed with integers (or Flux's OneHotVector, which is a little smarter about this sort of thing). But you're right that you'd have to keep around two structures to go from string to id to (tracked) vector.

I guess to put it another way, I'd like to try to come up with a minimal, flexible set or methods that can work easily with e.g. Tensorflow.jl, Flux.jl, and/or other packages like Distances.jl without actually needing to know anything at all about their implementation details. Probably a good way to figure this out would be to actually write a few models that need to do these things and see which parts are easy and which parts are hard, and then look at how Embeddings.jl could make the hard parts easier.

dellison commented 6 years ago

We can do dispatches for both AbstractString and index (which would be anything else. Mostly Int / Colon but potentially Vector{Bool} etc).

And it is probably fine to, if given a string, automatically convert it to a vocab index.

WE also have the ability to overload both call syntax and indexing syntax. Not entirely sure if it is useful, but it might be? Like embs("foo") could be different from embs["foo"]

I've actually had these ideas too! I'm also not totally sure about it, but I do think I like it.

oxinabox commented 6 years ago

Maybe I miss your point, but I think Flux's param function does exactly that.

I didn't have much of a point, just putting it out there.

I think for flexibility, we should expose functions, rather than fields, for getting a Matrix view, (possibly convert(::Type{<:AbstractMatrix}, embtable) ? That might be too clever.) and a vocab list.

Then if we have different types, they can internally use different representations