Contributing + LDA PR (maybe) - Githubissues

AtheMathmo / rusty-machine

Machine Learning library for Rust

https://crates.io/crates/rusty-machine/

MIT License

1.25k stars 153 forks source link

Contributing + LDA PR (maybe) #15

Closed EntilZha closed 8 years ago

EntilZha commented 8 years ago

Saw your post on reddit + browsing around crates.io for different packages.

Is there a reason why the linear algebra library isn't split out? Just for simplicity? Seems like there isn't a Rust standard linalg library so that might be helpful (nalgebra seems focused on low-dimensionality)
How does this compare to https://github.com/maciejkula/rustlearn?
What kind of algorithms are missing, but you're looking at add?

Related to the last one, I do/did some research and implementation work for Topic Modeling with Latent Dirichlet Allocation. I might be interested in learning rusty-machine more by implementing and contributing to it. Good places to get started with code?

AtheMathmo commented 8 years ago

Hey!

I had initially grouped them together for simplicity when I started out. Now there is no (good) reason why they shouldn't be separated - I just haven't got round to it yet.
Good question! I haven't tried using rustlearn myself but it seems a little more polished in terms of performance - it makes use of sparse matrices and (I believe) uses some bindings to blas and lapack. It also adds some support for cross validation, which is awesome. With rusty-machine I wanted to try and provide a wide range of out the box algorithms that are easy for users to tweak and customise. I'm still working on achieving that well but I'm getting there. I've also tried to provide the tools for people to make their own models and plug into existing library components (i.e. use existing gradient descent algorithms).
Anything! I've hit a point now where my focus for a while is going to be cleaning up the existing algorithms. Specifically GLMs need some love (managing convergence conditions and failure cases), and SVMs could use some attention (they are currently using a very simple sub-gradient method). After that I'll turn my attention to playing around with the neural networks a bit more.

It would be awesome to see LDA added and I'm happy to talk through specifically what you need and where to find it. I'm not super familiar with LDA but I think you should be able to use the UnSupModel trait as a framework for the model. You can check out the k-means model for an example.

EntilZha commented 8 years ago

Thanks for answers! I was also interested after looking at the code if there is a particular API you are drawing inspiration from (eg SKLearn)? Similar question, are there any/many text pre-processing methods or is that another area good for contribution? Also related, how easy would it be to create infrastructure for ML Pipelines (like SKLearn Pipeline or ML Pipelines in Spark)?

AtheMathmo commented 8 years ago

I didn't directly base the API off of any other tools I'd used - however a lot of them definitely provided inspiration. I guess I wanted to provide the simplicity of R (when fitting linear models for example) with the expressiveness of sklearn. I find sklearn a little hard to use at times because of the overwhelming number of ways to use a model. Of course this isn't necessarily a downside but motivated my initial idea of trying to achieve this without a bloated API. As time goes on maybe these goals will change.

I started playing around with text preprocessing but didn't get very far. My plan was to integrate this into rusty-machine (I took a lot of inspiration from pandas for this). I'll probably go back to it at some point. I believe there are some other existing libraries like Cuticula from AutumnAI - I haven't really looked at those properly.

I'm not sure how hard the pipeline stuff would be. I've thought about it a little but before I could consider it seriously I'd have to think about how I'm handling data inputs and outputs to the models. I'm sure it is very possible but it hasn't been a priority for me so far.

AtheMathmo commented 8 years ago

Hey @EntilZha. Are you still interested in contributing? It's no problem if not, just want to know if I should keep this issue open.

EntilZha commented 8 years ago

I am still interested in contributing, but have been busy with PhD schoolwork/research for the past couple weeks. I am fine with keeping this open or closing it, and when I have something ready submitting a PR

AtheMathmo commented 8 years ago

Awesome! I'll keep the ticket open but I won't keep pestering you - you sound busy. Take your time :)

AtheMathmo commented 8 years ago

Hey @EntilZha . I'm going to close off this ticket - as otherwise I'm worried I'll forget to later! It would be great if you do contribute but no pressure. If I can help at all let me know!