More comparison with existing methods? - Githubissues

digantamisra98 / Mish

Official Repository for "Mish: A Self Regularized Non-Monotonic Neural Activation Function" [BMVC 2020]

https://www.bmvc2020-conference.com/assets/papers/0928.pdf

MIT License

1.29k stars 130 forks source link

More comparison with existing methods? #24

Closed DonaldTsang closed 4 years ago

DonaldTsang commented 4 years ago

Just wondering if all Activation Functions have been addressed in the ReadME.

DonaldTsang commented 4 years ago

Sent a notice to each of the repos for further comments.

digantamisra98 commented 4 years ago

Hi. Thanks for raising the issue. Comparison/ Benchmarks have been presented in this paper/ repository for Mish vs Swish, GELU and SELU. PAU and X-Units are not in my priority list to have a comparison against, but I can definitely run some experiments in the coming week. Additionally, xUnit is a block and not a function, so the more sensible approach would be to replace the non-linearity in xUnit with Mish and compare the two variant. Also, for all the other activations that has been mentioned in the links you have posted are not in my To-do list currently. And as I see those activation functions are pretty exotic, usually we try to compare against activation functions used in General practices. But, if you'd like to benchmark against some of the activation in those list which are already not compared against, feel free to report the results. Thanks! Will be closing the issue as of now. Please re-open the issue at your own discretion.

DonaldTsang commented 4 years ago

And as I see those activation functions are pretty exotic, usually we try to compare against activation functions used in General practices.

It is due to their exotic nature that make it interesting to compare, as they may contain hidden information regarding what an optimal activation function should or should not look like. I would like to look into the issue as well.

For reference one of the paper "Searching for Activation Functions" has a GitHub at https://github.com/Neoanarika/Searching-for-activation-functions and it might be possible to integrate that into the test.

digantamisra98 commented 4 years ago

@DonaldTsang "Searching for Activation Functions" is the paper for Swish. All of my tests have compared Mish with Swish. What do you mean exactly by tests?

DonaldTsang commented 4 years ago

@digantamisra98 the paper itself did list other "exotic forms" (not Swish itself) in Table 2 that are not on the table in the readME of the Mish, I would assume it is due to differences in naming schemes? If it is not just a difference in naming schemes, and that there are some activation functions that could be integrated into the repo for benchmarks, that would be great.

digantamisra98 commented 4 years ago

@DonaldTsang The authors of that paper used a reinforcement learning algorithm to search the function space to obtain the best possible non-linear function which qualifies as an activation function. Out of all that were obtained in that search, Swish performed the best and hence I used Swish as a comparison benchmark against Mish and not the other activation which the algorithm found in that paper.

DonaldTsang commented 4 years ago

@DonaldTsang So the other algorithms in https://github.com/Neoanarika/Searching-for-activation-functions/blob/master/src/rnn_controller.py#L22 might not be as useful or as common, but worth exploring, I would assume? Or are you saying that the activation functions listed in the paper itself is "filler"?

digantamisra98 commented 4 years ago

@DonaldTsang The other activation found by the search in that paper were not as efficient as Swish. Quoting from the paper's abstract itself:

Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, f(x) = x · sigmoid(βx), which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets.

DonaldTsang commented 4 years ago

But would they at least have some historical significance, as some are "close calls" e.g. (atan(x))**2−x and cos(x)-x? For notes though max(x,tanh(x)) is basically ISRLU with tanh instead of ISRU, which can also be switched out with atan or softsign.

digantamisra98 commented 4 years ago

@DonaldTsang my current work with Mish involves more about Mean Field Theory - helping to find the Edge of Chaos and Rate of Convergence for Mish. These are more relevant since it will help to understand more of what's an ideal activation function like.