Closed sebpuetz closed 5 years ago
numpy support proposed in #7
The problem with getting the full embedding matrix is that it entails a copy. We cannot easily make the Rust matrices available due to the borrows checker --- when an Embeddings
instance is not used anymore, the matrix is deallocated and any views would be dangling.
What about a to_numpy_matrix()
method and explicitly mention in the docstring that it means copying the full matrix?
When suggesting to make the vocab accessible in Python, I was under the impression that we put the token frequency into the finalfusion
vocabs, too. That's not the case, so without that being present, I don't see that much of a reason to provide access to the vocab. Perhaps, giving non-iterator access to the list of in-vocab tokens would be nice, although I'm unsure how to implement this in pyo3
without cloning.
At least, I didn't know how to make the compiler happy about that simple method (I added a .vocab()
method on EmbeddingsWrap
):
/// Embeddings metadata.
fn words(&self) -> &[String] {
self.embeddings.borrow().vocab().words()
}
What about a to_numpy_matrix() method and explicitly mention in the docstring that it means copying the full matrix?
Something like to_numpy_matrix()
sounds good, perhaps with different naming, because it suggests that other functions not support numpy arrays. Maybe copy_matrix
or matrix_copy
?
At least, I didn't know how to make the compiler happy about that simple method
That would not be safe, but you can do it in a safe manner. You should create another struct that contains an Rc
of the embeddings and then implement __get_item__
and __len__
.
Something like to_numpy_matrix() sounds good, perhaps with different naming, because it suggests that other functions not support numpy arrays. Maybe copy_matrix or matrix_copy?
Sure, also I think the to_*
convention is more of a Rust than a Python thing.
That would not be safe, but you can do it in a safe manner. You should create another struct that contains an Rc of the embeddings and then implement __get_item and len__.
I have a hard time to understand what's actually happening with the Python interface. Inside Rust I feel like I have a fairly good understanding of what's not possible for what reasons, but here that's not all that intuitive to me.
When we load Embeddings
through the Python module into memory, the Embeddings
struct persists in memory until the Python GC decides to get rid of it (no more references left?). My method from above would have provided a reference to the Vocab
owned by the Embeddings
struct, if all references to the Embeddings
within Python are gone, the Vocab
is deallocated together with the Embeddings
but the pointers to the Vocab
would remain and thus be invalid?
To circumvent this, you are suggesting a struct like this:
/// finalfusion vocabulary.
#[pyclass(name=Vocab)]
struct PyVocab {
embeddings: Rc<PyEmbeddings>
}
and implement __len__
and __get_item__
on it?
Not sure if there's more, but that's what I remember: