Adding text vectorization

AtheMathmo / rusty-machine

Machine Learning library for Rust

https://crates.io/crates/rusty-machine/

MIT License

1.25k stars 153 forks source link

Adding text vectorization #177

Open ghost opened 7 years ago

ghost commented 7 years ago

Hey guys and ladies!

I was wondering (and I'm offering myself to work a little bit on this) if you consider appropiate to add some text vectorization to rusty-machine based on sklearn current features:

Simple frecuency count
TF-IDF
Hashing techniques (frecuency count + hashing trick) I'd be pretty cool to add some examples of sentiment analysis or something like that using rusty-machine only :P

tafia commented 7 years ago

I'd personally love that!

ghost commented 7 years ago

I was thinking on using the Transformer trait. However is not appropiate because it ask that the input and output should be of the same type

AtheMathmo commented 7 years ago

I agree that this is a really great idea!

It seems an unfortunate restriction that you cannot use the Transformer trait. I think that it might be worth changing the trait to allow different input and output types. Do either of you see any reason why this might cause issues? It would be a fairly minor breaking change (for users who have implemented the trait themselves).

ghost commented 7 years ago

I'm just implemented a Vectorizer trait that is pretty similar to Transformer, it could be used as base for non text stuff, like images or nested data for example. Here is a little proof of concept:

https://github.com/z1mvader/rusty-machine/blob/master/src/data/vectorizers/text.rs

But if @AtheMathmo wants we could just modify the Transformer trait

ghost commented 7 years ago

Besides the Transformer trait, I believe that there are two main needs for the text vectorization workflow. First, to be able to set your own tokenizer. And second, to allow sparse matrices/vectors. I don't know if rusty-machine supports sparse matrices right now