AtheMathmo / rusty-machine

Machine Learning library for Rust
https://crates.io/crates/rusty-machine/
MIT License
1.25k stars 152 forks source link

Latent Dirichlet Allocation #172

Open dyule opened 7 years ago

dyule commented 7 years ago

I've created a basic implementation of linear dirichlet allocation. Unlike the other learning algorithms, it's unsupervised, so the training method is just empty. It doesn't support non-symmetric parameters, but that could fairly easily be added in. I've tried to document everything clearly.

However I built this with v4 of rulinalg, so this pull request includes #167.

I'm happy to take any feedback on the approach here, and update as necessary.

dyule commented 7 years ago

Just realized I called it linear dirichlet allocation in the title. It's Latent Dirichlet Allocation. I constantly get that backwards, but it's right in the code.

AtheMathmo commented 7 years ago

Thank you for this! I've been hoping to make LDA a part of rusty-machine for a while now :).

Sadly I'm no expert on LDA and I am particularly busy at the moment so I'll need a little time to digest this PR. Can you poke me if I still haven't reviewed this by the end of the week?

dyule commented 7 years ago

If it helps, the only two files that are from LDA and not the rulinalg bump are examples/lda_gen.rs and src/learning/lda.rs. I don't believe I changed anything else of importance.

dyule commented 7 years ago

I've updated the code to be a little more efficient and documented better.

AtheMathmo commented 7 years ago

Thanks for your patience. I finally have some time to look through this and leave some comments. I'm still not too familiar with LDA but hopefully can provide some meaningful comments anyway.

dyule commented 7 years ago

Thanks for taking the time to review this. I very much appreciate all your comments. I've responded to each of them with my thought process, but I will incorporate them into my changes.

For the example, I suppose I missed the listing of examples in the README, I will make sure to include it. I've documented the file pretty extensively, since as you say, it's fairly involved.

The algorithm you've shown does seem to be an improvement, and one that I'd like to incorporate down the road. My reasoning for using Gibbs sampling was a) it was the original approach and b) I actually implemented it for my own research, where I'm obligated to use that particular approach. As I say, I'm happy to include it, but doing so will take some time that I possibly do not have at the moment.

As I mentioned in my inline comments, I misunderstood a few things about the organization of the crate. I plan to move the main code from predict to train and make predict be a no-op. However, looking at the paper you linked, and the scikit implementation, it seems like predict is very similar to scikit's transform. That is, passing input data into transform results in a distribution of categories over documents. I believe I could do this with the current algorithm, but it will be better once we're using online VB.

So, I will take some time to update the changes suggested, but leaving the central algorithm intact, then let you know. When I have time, I'll look into changing to Online VB, for a faster and more flexible approach.

AtheMathmo commented 7 years ago

I haven't read all your comments but from your summary I think your approach sounds great. Keep the core algorithm as it is and we can write up an issue for future work with VB once this has been merged.

it seems like predict is very similar to scikit's transform

Scikit uses both a predict and a transform function depending on the model. See their GMM docs. We also have both in rusty-machine (though note that there have been some changes to this trait on the current master branch). If you think this will provide a better fit you could switch to this trait instead. Right now these traits are a little disorganized but I'm hoping to address that in a future release.

dyule commented 7 years ago

The Transform trait does in fact seem like an excellent fit. I have only one thought. Currently, the parameterization of TransformFitter uses T to indicate a Transformer whereas the parameterization of Transformer uses T to indicate some kind of input type. It's a small thing, but it did cause me some confusion when first reading the code.

Also, transformer current requires the input and output of the transform to be of the same type, but that may not necessarily be the case. For example, my input is an integer (word counts), and it'll be transformed into f64 (probabilities)

AtheMathmo commented 7 years ago

Also, transformer current requires the input and output of the transform to be of the same type

Ah yes, that is true. I'm not sure whether we want to lift this restriction as future changes to the UnSupModel should give the behaviour you want. I'll have to think about things a little more...

dyule commented 7 years ago

So, I've made the changes promised. I've implemented LDA as a Transformer, which involved allowing them to have different input and output types. I'm happy to switch it back to being a learning model if that's where you think it fits best.

What it comes down to is the semantics of a Transformer, and where LDA fits into that idea. Currently, LDA has much more in common with the other learning models than with the other transformers, which all have to do with pre-processing the input data, but I'm not sure what your long term intention is there.

patrickmesana commented 7 years ago

Don't you mean Latent Dirichlet Allocation?

rohitjoshi commented 6 years ago

👍 waiting for merge

zackmdavis commented 6 years ago

waiting for merge

(Deputy acting co-maintainer here, but time-crunched at the moment; have made a note to take a look early next week)

zackmdavis commented 6 years ago

(or maybe this week)