Open ghost opened 7 years ago
I'd personally love that!
I was thinking on using the Transformer trait. However is not appropiate because it ask that the input and output should be of the same type
I agree that this is a really great idea!
It seems an unfortunate restriction that you cannot use the Transformer
trait. I think that it might be worth changing the trait to allow different input and output types. Do either of you see any reason why this might cause issues? It would be a fairly minor breaking change (for users who have implemented the trait themselves).
I'm just implemented a Vectorizer
trait that is pretty similar to Transformer
, it could be used as base for non text stuff, like images or nested data for example. Here is a little proof of concept:
https://github.com/z1mvader/rusty-machine/blob/master/src/data/vectorizers/text.rs
But if @AtheMathmo wants we could just modify the Transformer
trait
Besides the Transformer
trait, I believe that there are two main needs for the text vectorization workflow. First, to be able to set your own tokenizer. And second, to allow sparse matrices/vectors. I don't know if rusty-machine supports sparse matrices right now
Hey guys and ladies!
I was wondering (and I'm offering myself to work a little bit on this) if you consider appropiate to add some text vectorization to rusty-machine based on sklearn current features: