Open oxinabox opened 6 years ago
Here are a couple of my initial thoughts.
I think one feature that any reasonable API will require is a fast way to go from word (to index) to vector. Currently this mapping is implicit, but walking through the word list for every vector lookup would be far too slow. The simplest way to do this would be to just add a Dict{String, Int}
field to each of the EmbeddingSystem
subtypes.
More generally, I think that it would be best to keep the interface minimal and "unopinionated." Embeddings.jl already seems to embrace this. I think that a Base.get
-like interface would be best for up embeddings with a word key. This isn't real code, but I'm imagining something along the lines of:
julia> w2v = load_embeddings(Word2Vec{:en}, args...)
julia> word_vector(w2v, "the")
300-element Array{Float64,1}:
0.1
0.2
...
julia> word_vector(w2v, "notfound")
Error: "notfound" not in vocabulary
julia> oov_vector = zeros(size(w2v.embedding_matrix, 1))
300-element Array{Float64,1}:
0.0
0.0
julia> word_vector(w2v, "notfound", oov_vector)
300-element Array{Float64,1}:
0.0
0.0
...
julia> word_vector(() -> randn(300), w2v, "notfound")
300-element Array{Float64,1}:
-1.784924003687886
1.3056642179917306
# implementing that method for each type would also permit this syntax:
julia> vec = word_vector(w2v, "notfound") do
# do some fancy calculation
return vector
end
I believe an approach like this would work well alongside system-specific needs of things like OOV interpolation as well. Maybe the default behavior for word_vector(w2v, "notfound")
throws an error for Word2Vec for word_vector(ft::FastText, "notfound")
computes a vector and returns it. A one-function interface like this would allow this kind of flexibility, I think. In general, I think we should try to have a common interface for looking up vectors that makes as few assumptions as possible. After all, embeddings are pretty much always a very small component of a larger model, so making Embeddings.jl small and composable is what we should aim for, I think.
Some very scatted thoughts:
What I currently do, when using this in the wild, is to use MLLabelUtils https://github.com/JuliaML/MLLabelUtils.jl
using Embeddings
using MLDataUtils
o_embedding_table = load_embeddings(Word2Vec)
# Add OOV zero vector
const embedding_table = Embeddings.EmbeddingTable(
[o_embedding_table.embeddings zeros(300)],
[o_embedding_table.vocab; "<OOV>"]
)
const enc = LabelEnc.NativeLabels("<OOV>", embedding_table.vocab)
to_ind(lbl, enc=enc) = convertlabel(LabelEnc.Indices, lbl , enc)
See in https://github.com/oxinabox/AuthorGenderStudy.jl/blob/master/src/proto.ipynb
When working with this in the something, it is often good to covert all your words into integers, because that takes up less space. So MLLabelUtils is good for that. And the encoder I showed above converts words to indexes, and handles OOV as well. (if an OOV label wasn't provided it would error.
Two advantages: point 1) is the indexing with Ints can be performed trivially in all systems. point 2) is that it saves memory, since it removes duplicates. And lets you do Integer comparisons rather than string comparisons (so faster).
E.g. on point 1) When working with say TensorFlow.jl you really do want to just stick the embedding matrix into a Tensor (so it can move to the GPU), and index it it with Ints. E.g. https://github.com/oxinabox/ColoringNames.jl/blob/abff47ea4db3a3137cb95ae2ee92d11d4a68d028/src/networks/networks_common.jl#L136-L137
But in, e.g. Flux,
we might want to be able to use Strings.
Since it doesn't have any real limmittations.
Though the Matrix would have to be converted to a Tracked
structure to be bale to fine tune it in training.
But that could be a Dict{String, TrackedVector}
On point 2. InternedStrings.jl solves this also. (https://github.com/JuliaString/InternedStrings.jl)
More thoughts
We can do dispatches for both AbstractString
and index (which would be anything else. Mostly Int
/ Colon
but potentially Vector{Bool}
etc).
And it is probably fine to, if given a string, automatically convert it to a vocab index.
WE also have the ability to overload both call syntax and indexing syntax.
Not entirely sure if it is useful, but it might be?
Like embs("foo")
could be different from embs["foo"]
Any thoughts on OOV interpolation? I was thinking of StringDistances.jl but it may be too slow.
Regarding the embeddings format, the simplest data structure possible is preferable in my case (i.e. Dict{AbstractString, AbstractVector}
or the current EmbeddingTable
) with conversion methods to support required formats for other packages to_tensorflow_embeddings(embeddings)
and overloaded indexing syntax (over call sytax).
When I say OOV interpolation, I mean for something like FastText, which is a factored model, and can determine word embeddings using it's ngram embeddings.
I don't mean just finding the nearest word according to Levenshtein Distance, and returning that.
OOV interpolation is a problem that needs to be solved in the numerical embedding space, not in the natural language space.
Preprocessing to correct misspellings is beyond the scope of this package.
Makes sense, thank you. Any plans for including that in Embeddings.jl ?
FastText interpolation? Probably, I think it is basically free once #1 is done.
When working with this in the something, it is often good to covert all your words into integers, because that takes up less space. So MLLabelUtils is good for that. And the encoder I showed above converts words to indexes, and handles OOV as well. (if an OOV label wasn't provided it would error.
Two advantages: point 1) is the indexing with Ints can be performed trivially in all systems. point 2) is that it saves memory, since it removes duplicates. And lets you do Integer comparisons rather than string comparisons (so faster).
Yes, completely agreed on all points. MLDataUtils is definitely great, and does this well. My feeling is that because this word -> int conversion is so fundamental to using word embeddings, it probably makes sense for Embeddings.jl to provide a lightweight way to do it. If this functionality does become part of Embeddings.jl, I'd expect it to work like that.
Regarding using Embeddings.jl alongside TensorFlow.jl or Flux, yes, I think it makes sense to have any API be generic and flexible enough to work well with both.
But in, e.g. Flux,
we might want to be able to use Strings. Since it doesn't have any real limmittations. Though the Matrix would have to be converted to a Tracked structure to be bale to fine tune it in training. But that could be a Dict{String, TrackedVector}
Maybe I miss your point, but I think Flux's param
function does exactly that. Flux.param(w2v.embedding_table)
would create an embedding matrix that can easily indexed with integers (or Flux's OneHotVector
, which is a little smarter about this sort of thing). But you're right that you'd have to keep around two structures to go from string to id to (tracked) vector.
I guess to put it another way, I'd like to try to come up with a minimal, flexible set or methods that can work easily with e.g. Tensorflow.jl, Flux.jl, and/or other packages like Distances.jl without actually needing to know anything at all about their implementation details. Probably a good way to figure this out would be to actually write a few models that need to do these things and see which parts are easy and which parts are hard, and then look at how Embeddings.jl could make the hard parts easier.
We can do dispatches for both AbstractString and index (which would be anything else. Mostly Int / Colon but potentially Vector{Bool} etc).
And it is probably fine to, if given a string, automatically convert it to a vocab index.
WE also have the ability to overload both call syntax and indexing syntax. Not entirely sure if it is useful, but it might be? Like embs("foo") could be different from embs["foo"]
I've actually had these ideas too! I'm also not totally sure about it, but I do think I like it.
Maybe I miss your point, but I think Flux's param function does exactly that.
I didn't have much of a point, just putting it out there.
I think for flexibility,
we should expose functions,
rather than fields,
for getting a Matrix view, (possibly convert(::Type{<:AbstractMatrix}, embtable)
? That might be too clever.)
and a vocab list.
Then if we have different types, they can internally use different representations
Shifting discussion form #14