finalfusion / finalfusion-python

Finalfusion embeddings in Python
https://finalfusion.github.io/
Other
9 stars 4 forks source link

Add convenience methods etc. #5

Closed sebpuetz closed 5 years ago

sebpuetz commented 5 years ago

Not sure if there's more, but that's what I remember:

danieldk commented 5 years ago

numpy support proposed in #7

danieldk commented 5 years ago

The problem with getting the full embedding matrix is that it entails a copy. We cannot easily make the Rust matrices available due to the borrows checker --- when an Embeddings instance is not used anymore, the matrix is deallocated and any views would be dangling.

sebpuetz commented 5 years ago

What about a to_numpy_matrix() method and explicitly mention in the docstring that it means copying the full matrix?

When suggesting to make the vocab accessible in Python, I was under the impression that we put the token frequency into the finalfusion vocabs, too. That's not the case, so without that being present, I don't see that much of a reason to provide access to the vocab. Perhaps, giving non-iterator access to the list of in-vocab tokens would be nice, although I'm unsure how to implement this in pyo3 without cloning.

At least, I didn't know how to make the compiler happy about that simple method (I added a .vocab() method on EmbeddingsWrap):

    /// Embeddings metadata.
    fn words(&self) -> &[String] {
        self.embeddings.borrow().vocab().words()
    }
danieldk commented 5 years ago

What about a to_numpy_matrix() method and explicitly mention in the docstring that it means copying the full matrix?

Something like to_numpy_matrix() sounds good, perhaps with different naming, because it suggests that other functions not support numpy arrays. Maybe copy_matrix or matrix_copy?

At least, I didn't know how to make the compiler happy about that simple method

That would not be safe, but you can do it in a safe manner. You should create another struct that contains an Rc of the embeddings and then implement __get_item__ and __len__.

sebpuetz commented 5 years ago

Something like to_numpy_matrix() sounds good, perhaps with different naming, because it suggests that other functions not support numpy arrays. Maybe copy_matrix or matrix_copy?

Sure, also I think the to_* convention is more of a Rust than a Python thing.

That would not be safe, but you can do it in a safe manner. You should create another struct that contains an Rc of the embeddings and then implement __get_item and len__.

I have a hard time to understand what's actually happening with the Python interface. Inside Rust I feel like I have a fairly good understanding of what's not possible for what reasons, but here that's not all that intuitive to me.

When we load Embeddings through the Python module into memory, the Embeddings struct persists in memory until the Python GC decides to get rid of it (no more references left?). My method from above would have provided a reference to the Vocab owned by the Embeddings struct, if all references to the Embeddings within Python are gone, the Vocab is deallocated together with the Embeddings but the pointers to the Vocab would remain and thus be invalid?

To circumvent this, you are suggesting a struct like this:

/// finalfusion vocabulary.
#[pyclass(name=Vocab)]
struct PyVocab {
    embeddings: Rc<PyEmbeddings>
}

and implement __len__ and __get_item__ on it?

sebpuetz commented 5 years ago
sebpuetz commented 5 years ago

Wondering whether this applies to class methods, too. I think at this point the signatures are not available in the python module.