TomasBeuzen / pybeach

A Python package for locating the dune toe on cross-shore beach profile transects.
MIT License
18 stars 12 forks source link

Suggestion: Improve ML performance by only loading model file once #9

Closed chrisleaman closed 3 years ago

chrisleaman commented 4 years ago

I've run into a bit of a performance issue when trying to predict dune toes for 1000's of profiles using the machine learning approach. I thought that doing this should be quite fast since everything is built on numpy. Example of the code I'm trying to run is:

# List of x and y coordinates for a number of profiles
x_s = [(0,1, 2, 3, ...), (0, 1, 2, 3,...), ...]
y_s = [(0.1, 0.2, 0.3,...), (0.1, 0.2, 0.3,...), ...]

# Iterating over the profile coordinates is not as fast as it should be
for x, z in zip(x_s, ys):
  p = Profile(x, z)
  toe_ml, prob_ml = p.predict_dunetoe_ml('wave_embayed_clf')

I found the problem to be here: https://github.com/TomasBeuzen/pybeach/blob/10d9cf28134edce25adb8f65584a7b755426455c/pybeach/beach.py#L151-L155 Basically, everytime we make a new profile, we are reloading the model file which is an I/O bottleneck when everything else is numpy based. It'd be better if we only loaded the model file once and then predicted the profiles based on that. I think that making a new class for each predictor might be a better way to go? Something like this as a usage example:

# For machine learning:
ml_predictor = MLPredictor(model='mixed_clf')  # load model file once

# Now can reuse the predictor without reloading the model file
profile1_x_toe_ml = ml_predictor.predict(x=profile1_x, z=profile1_z)
profile2_x_toe_ml = ml_predictor.predict(x=profile2_x, z=profile2_z)

# You'd also have a new class for each other method
rr_predictor = RRPredictor(window_size=21)
profile1_x_toe_rr = rr_predictor.predict(x=profile1_x, z=profile1_z)

I'm happy to put in a pull request for this, but it'd involve refactoring and rearranging the code. Just wanted your thoughts @TomasBeuzen if you're happy with this, or there could be any other alternatives to improve performance?

TomasBeuzen commented 4 years ago

Hm thanks for pointing this out. I like the idea of classes, but I originally didn't go down that path because I felt it was easier and more efficient to have the different models implemented as methods of a base class. The only other solution I can think of at this point is to make the ML model an argument to the predict_dunetoe_ml method. The model can be pre-loaded using some function. In that case the code would work like this:

clf = load_model('mixed_clf')

for x, z in zip(x_s, ys):
  p = Profile(x, z)
  toe_ml, prob_ml = p.predict_dunetoe_ml(model=clf)

Making everything into classes will require a lot of refactoring, but perhaps it is the best way to go. I wanted to do some cleaning up anyway and dump all the dtype checking in the main file to a utility script anyways. Let me think on it over the next few days :)

chrisleaman commented 4 years ago

No worries! Making the model an argument also looks like it would work as well ๐Ÿ‘ Just let me know if you want help - I'm more than happy to put in a PR depending what you think'll be best ๐Ÿ˜„

TomasBeuzen commented 4 years ago

@chrisleaman

Okay so I've finally had clear headspace to think about this. Everything in pybeach is vectorized. If your profiles all share the same x-coordinate, you can pass in a m*n matrix of z-values (where m is the number of profiles and n is the same length as x). So your code above can be written like this:

# List of x and y coordinates for a number of profiles
x = [0,1, 2, 3, ...)] # all profiles share same x-coordinate
y_matrix = [(0.1, 0.2, 0.3,...), (0.1, 0.2, 0.3,...), ...]

# Create a single Profile class for handling all profiles (ML model will be loaded only once)
p = Profile(x, y_matrix)
toe_ml, prob_ml = p.predict_dunetoe_ml('wave_embayed_clf') # toe ml will have length equivalent to number of rows in  y_matrix

As mentioned, pybeach assumes that all profiles share the same x-coordinate (which in your toy example seems to be the case). However, I have not yet implemented handling of the case where you pass in a matrix of different x-coordinate vectors which correspond to a matrix of profiles. That's on the list and I'll implement in the coming weeks.

chrisleaman commented 4 years ago

Ah nice! I hadn't thought of passing a matrix but that makes sense. One thing doing it this way is that n would have to be the same for all profiles, which means you'd have to pad the matrix with np.nan's for short profiles. I wonder if this would be a problem for the ML approach.

Otherwise modifying the code to handle a matrix of x-coords as well sounds like a good idea. Feel free to assign any tasks/issues to me as well ๐Ÿ˜Š

TomasBeuzen commented 3 years ago

Doing some repo clean-up...

Closing this as there's no further action required.