Model trait APIs - Githubissues

theotherphil commented 8 years ago

Continuing the discussion on model traits started in the comments for https://github.com/AtheMathmo/rusty-machine/pull/120.

Suggestions for possible changes:

Separate the notions of modelling method and model

(Or model and fit, or trainer and model, or some other names to be decided.)

The new functions would have signatures along the lines of:

Trainer::train : training data -> Model Trainer::predict : (data, Model) -> Prediction

Advantages:

Removes the possibility of calling predict on an untrained model.
Clearly delineates the model fit data from the training algorithm. This will be useful when implementing serialisation.

Drawbacks:

API could be less intuitive for people familiar with popular current machine learning libraries.

Further questions:

How should online training be dealt with? What about updating models to take account of new data without completely retraining (e.g. updating class distributions in the leaves of a random forest)?

Associated types

Currently SupModel<T, U> takes its input and output types as type parameters. Should they be associated types instead?

Advantages:

Fits more closely with the typical meanings of type parameters vs associated types in Rust (I think!). In general, the users of a model don't get to choose the types it acts on - these are determined by the model itself (some models do give the user some say over the types used, but in this case these should be parameters to the specific model rather than the trait).
Shrinks signatures of functions which are generic over models. f<M, T, U> where M: SupModel<T, U> becomes f<M: SupModel>.

Drawbacks:

I can't think of any off the top of my head, but I may well be missing something.

Other bits and pieces

Algorithms using randomness should always let the user provide a seed. Otherwise regression testing becomes impossible.
Should we consider the algebraic traits used in HLearn? This might give us efficiency wins in some cases, but might also scare away potential users.

Caveat: I have very little rust experience, and so don't really know what I'm talking about!

AtheMathmo commented 8 years ago

Thanks for opening this - it will definitely need some discussion!

I'm (finally!) writing some code using this library

Awesome!

As for the signatures, there is no good reason. I hadn't really imagined a situation in which code would be generic over models when I first created these traits. Now I can see it being more necessary with things like cross validation.

This library was my first ever piece of rust code and you can tell from the API :). I'm going to put some work into trying to bring it up to date soon (hopefully the next couple of days). I'd like to try and support the trait system you brought up in #120 . And additionally I'll introduce associative traits too. Once those are in place I think I'll find it a little easier to visualize and discuss other things we can change.

Could you edit in the trait changes you brought up in #120 into the body of this issue?

theotherphil commented 8 years ago

Sure, I've paraphrased the discussion there and added a few more comments.

AtheMathmo commented 8 years ago

I agree with pretty much everything you've written - it's a good summary of why I'd like to explore these things. One thing I'd add is that the separation of training and model also lets us employ some rust patterns nicely - things like the builder pattern start to feel a lot more natural. (We build a training algorithm and finally train it to get a model let model = KMeans::new(2).iters(100).init(Forgy).train(&inputs)).

You mention that users do not choose the types for the models - right now this is temporary as we would like to allow users to choose which floating point types to use (it defaults to f64 across the board right now).

As for your other bits and pieces comments - I agree with both. But I think they are probably out of scope for this work.

The random seeding is definitely relevant (and would help our own testing) but it's something I'd put off as less important while we try to figure out the best API. For example should the seed be a model parameter, part of the training traits, or captured elsewhere? (Probably the first or last).

I'd been pointed towards HLearn before and it is really cool. We seem to share quite similar goals though I think that trying to move towards the algebraic representations they have would be difficult and I'd prefer to improve performance in other ways first. (Also note that I don't know Haskell which makes it harder for me to learn from the project).

theotherphil commented 8 years ago

I imagined that to choose the floating point type for models you'd have something like:

trait SupModel {
    type Inputs;
    type Outputs;
    ...
  }

  struct GMM<T> { ... }

  impl SupModel for GMM<T> {
    type Inputs = Matrix<T>;
    type Outputs = Matrix<T>;
  ...
  }

This seems to better fit the approach described here: https://doc.rust-lang.org/book/associated-types.html.

I mentioned HLearn in this context because the typeclasses it uses are used to model properties like "can be trained in parallel" or "supports online training". Their approach might be relevant when deciding the equivalent APIs here. The website has links to papers and blog posts that cover everything relevant - the maths should hopefully be easy to follow even without knowing any Haskell.

AtheMathmo commented 8 years ago

Yeah the example there looks good. I was just referring to this statement:

the users of a model don't get to choose the types it acts on - these are determined by the model itself

It should be possible with associated types as you've done above - though I'd like to confirm with the compiler first :).

There is definitely a lot to learn from HLearn and hopefully I'll have some time to digest it a little soon.

theotherphil commented 8 years ago

I've implemented a very WIP version of cross validation, but 1. it's hard-coded to only accept models with inputs and targets of type Matrix<f64>, and 2. this means that it's impossible to write a non-copy version, as there's no way of selecting a random subset of rows and passing them to the train or predict functions of these models.

Options:

Switch models to using Iterator<Item=&[f64]> (which can be implemented for Matrix if it's not already).
Accept that cross-validation will involve copying.

If we go with 2. we can still either avoid any allocation by shuffling the input rows in place (this sounds like a bad idea), or limit ourselves to allocating a single array of size (k-1)/k * input data size. Maybe the latter option is acceptable?

Datasets also need to be split like this when training random forests (and presumably for some other learning algorithms), which might be an argument in favour of 1. Although it's possible that copying data around, and even allocating, might end up cheaper overall in some cases due to improved locality. Maybe?

AtheMathmo commented 8 years ago

Have responded in #114 just to minimize noise here for anyone else reading.

AtheMathmo commented 8 years ago

I have some proof of concepts for the ideas discussed here:

These have a few blocks before they make it in - and need some review and feedback. But at least it shows that things work! And actually the compiler plays a lot more nicely than I'd expected for the change to accept MatrixSlice.

AtheMathmo / rusty-machine

Model trait APIs #124