FluxML / Flux.jl

Relax! Flux is the ML library that doesn't make you tensor
https://fluxml.ai/
Other
4.55k stars 609 forks source link

No error from negative learning rates #1982

Closed KronosTheLate closed 1 year ago

KronosTheLate commented 2 years ago

In using baysean optimisation hyperparameter tuning, the learning rate was estimated to be negative at some point. This raised no error, as demonstrated in the example below:

Example ``` julia> using Flux julia> actual(x) = 4x + 2 actual (generic function with 1 method) julia> x_train, x_test = hcat(0:5...), hcat(6:10...); julia> y_train, y_test = actual.(x_train), actual.(x_test); julia> predict = Dense(1 => 1); julia> loss(x, y) = Flux.Losses.mse(predict(x), y); julia> using Flux: train! julia> opt = Descent(-1) Descent(-1.0) julia> data = [(x_train, y_train)] 1-element Vector{Tuple{Matrix{Int64}, Matrix{Int64}}}: ([0 1 … 4 5], [2 6 … 18 22]) julia> parameters = Flux.params(predict) Params([Float32[0.6683914;;], Float32[0.0]]) julia> train!(loss, parameters, data, opt) ```

This should raise an error, right? If yes, at what point does it make the most sense to check for this?

KronosTheLate commented 2 years ago

It appears as if gradient ascent is performed, so this could in fact be useful if one wishes to maximise the loss.

Gradient ascent example ``` julia> loss(x_train, y_train) 199.6509f0 julia> train!(loss, parameters, data, opt) julia> loss(x_train, y_train) 85815.984f0 julia> train!(loss, parameters, data, opt) julia> loss(x_train, y_train) 3.6921836f7 julia> train!(loss, parameters, data, opt) julia> loss(x_train, y_train) 1.5885496f10 julia> train!(loss, parameters, data, opt) julia> loss(x_train, y_train) 6.8346813f12 ```

But still, not checking for positive learning rates in the presence of automated hyperparameter tuning methods seems like a bit of a footgun.

ToucheSir commented 2 years ago

Doing a brief survey:

Also of note is that PyTorch optimizers validate more than the learning rate.

KronosTheLate commented 2 years ago

Trowing is not great, as it does make sense to perform gradient ascent with negative learning rates. The current outcome makes sense, and throwing an error feels a bit like assuming more than needed about what the user "actually" intends to do.

How about using @warn the first time it occurs in a run, like scalar indexing of CUDA arrays? This would alert the user who did is un-intentionally, while not stopping the one guy who wants to perform gradient ascent and figured this was a clever way to do it.

ToucheSir commented 2 years ago

I'm not a big fan of making users tolerate warnings to use the full range of functionality of a library (we've used them mostly for deprecations and flagging edge cases so far), but perhaps this is an exception.

KronosTheLate commented 2 years ago

Right, but I would not concider negative learning rates within the "full range of functionality", it feels more like of a way of breaking things. But that it because the way it breaks makes so much sense, throwing feels too restrictive. But doing nothing, as is the case now, is a bit of a rough edge for anyone using automated hyperparameter tuning.

KronosTheLate commented 2 years ago

I do not know what the governing structure is within Flux, but currently there are only 2 participants to this discussion. Is it possible to get a few more voices to agree/disagree with adding a warning for negative learning rates, to reach a conclusion? Is it too much to create a discourse post for it?

DhairyaLGandhi commented 2 years ago

I think it's a good idea to document the behaviour of negative learning rates in general to perform gradient ascent. @Saransh-cpp might be able to help with this.

Hyperparameter tuners usually have a range or function that decides these parameters, so knowing the "grid points" they generate would be a task for the tuner.

darsnack commented 2 years ago

I prefer not to error on something that is completely valid in an unconventional context (at least at the level of Flux.jl). There are libraries built on top of Flux that do provide sanity checks, and this seems like an appropriate check to add there. Similarly, if there is a library doing hyper-parameter optimization, then it seems like that library's prerogative to add a check for negative LRs.

If there's something that needs to be added to Flux.jl to make this check convenient, then we can do that. But this seems like a fairly simple check.

Alternatively, if we can provide a configurable switch for "safe mode" and "unsafe mode" for Flux (maybe using Preferences.jl like Brian suggested in the other thread), then I think either an error or warning is acceptable.

KronosTheLate commented 2 years ago

The problem for me then is that the hyperparameter tuning package I used, Hyperopt.jl, is agnostic to what the model is. It is not a given that the model being tuned has a learning rate.

Now, that is not a problem with Flux.jl. But I do not see much harm in adding a first-time-only printed warning, as very few users would ever encounter it. Of those who do, I am willing to bet that a non-trivial proportion do not intend to use negative learning rates.

I am not sure what the ramification of a safe mode is, but that sounds like a fine solution ^_^

darsnack commented 2 years ago

That makes sense. It looks like the "function" in your Hyperopt code will be a call to train! or some equivalent? In general, if you want train! to be "safe" then I suggest using FluxTraining.jl which can support all these safety checks. FluxTraining is smart enough to support different types of training schemes all at once. i.e. Flux doesn't know whether a user is doing standard supervised training or some unconventional training that requires gradient ascent, so we'll always be in a guessing position. In contrast, FluxTraining can have different training phases and run only the sanity checks that make sense for a given phase.

For your specific issue, I am inclined to agree that this particular warning would not be the most annoying, so I'd be fine with including it. Presumably, we'll need to add some kind of getter/setter interface for the LR so that every rule doesn't need to implement the check independently.

KronosTheLate commented 2 years ago

I have since realized that I should have used FluxTraining xD But I predict that others will try hyperparameter tuning by applying general tuning algorithms to Flux after me, which is why I suggest making changes. I have wasted a fair amount of computational resources doing gradient ascent at this point, and simply want to help others avoid the same.

What you mention about getter/setter interface is beyond me... But I think I have said my piece, and I do not think I will be of help as for the best way to implement it. Thanks for taking the issue seriously 🚀

ToucheSir commented 2 years ago

I still lean on the side of being restrictive and asking for forgiveness rather than permission. In other words, error on invalid inputs and relax those restrictions (ideally via Preferences or some similar mechanism) only when people start complaining about them. For cases like gradient ascent, I would consider those sufficiently niche that asking someone to implement their own rule to negate incoming gradients would not be out of line. The main issue with leaning on warnings by default is a) they're more easily swallowed up than errors, and b) users get conditioned to ignore them (and when they ignore one, they tend to ignore them all).

darsnack commented 2 years ago

True, as I think about this more, Descent clearly implies only one thing. Asking for Ascent to be defined is very reasonable. Okay so maybe I'm more accepting that this real invalid behavior and should error.

KronosTheLate commented 2 years ago

But if it applied to Descent (error on negative LR), do all optimizers error on negative LR? It is less obvious for other optimisers.

But yes, if one wants ascent they should simply apply some transformation like negation or inversion of the loss function they provide. With that, I feel like erroring on constructing optimisers with negative LR would be reasonable.

mcabbott commented 2 years ago

+1 for either throwing an error, always, or else accepting the input.

I don't think warnings on things which might be mistakes or might be intended is a great idea. Nor adding global state via Preferences etc. Defining your own rule should be the simple way to clearly express that you want something unusual.

It may add quite a bit of complication to add these errors. If negative learning rate is wrong, then e.g. momentum 0<ρ<=1 seems about equally wrong. And what β values are allowed in Adam say?

While errors may catch occasional fat-finger typos, their value for hyperparameter tuning seems limited. Are you going to wrap everything in try-catch to constrain parameters? Better would be for the tuning machine to be aware of these bounds, to explore only a sensible space. But this space probably wants tighter bounds than errors would encode anyway, e.g. 0 < η < 0.1.

KronosTheLate commented 2 years ago

It is true that generally the error checking is better to do on the hyperparameter tuning side of things than the Flux side of things. However, I proposed changes based on the following premise:

A non-trivial portion of the occurrences of negative learning rates are bugs.

This could be false, but my experience suggests it to be the case.

Are we disagreeing on the premise? Because if we don't, it seems to me like something should be done to help out users.

ToucheSir commented 2 years ago

I don't think there's any contention over the premise, but Michael's point is that we probably won't be able to provide one-size-fits all input ranges to validate all optimizer hyperparams against. The perfect being the enemy of the good though, we should try to capture as many as we can. I think this thread might be a good place to do so.

Out of curiosity, what hyperparam tuning algorithm are you using? I don't know of one which could generate a negative value for a parameter given a input range lower bounded on n >= 0, but I've also not used a ton of them.

KronosTheLate commented 2 years ago

Very fair point, thanks for the input.

Out of curiosity, what hyperparam tuning algorithm are you using? I don't know of one which could generate a negative value for a parameter given a input range lower bounded on n >= 0, but I've also not used a ton of them.

I was using Baysean optimisation as implemented in Hyperopt.jl, which uses a kernel density estimator. Currently there is no mechanic to apply bounds to the kernel density estimator, but I made an issue asking for the ability to apply bounds. If by change anyone is able to help out over there, that would be fantastic.

KronosTheLate commented 2 years ago

A relevant pasage from my bachelor thesis, which is the context in which I encountered this issue:

"... the resulting learning rate will be negative. This did surprisingly not raise any errors, and was therefore not discovered until well into the data analysis."

I do not think it is uncommon for there to be some time between computational work and data analysis that would reveal potential errors like this, which is one of the key reasons why I think it would be extremely helpful if tooling would help detect it earlier.

ToucheSir commented 2 years ago

I think there's a relatively clear path for changes everyone can agree on making, all that's left is to make them. This is now help wanted because I'm not sure if anyone has the bandwidth to tackle it in the near term.