-
I think the correct implementation should be like this.
```python
noise_scheduler = DDPMScheduler(
beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000, c…
-
I'm observing sensitivity wrt LR restarts in a typical SGDR schedule with cosine annealing as in Loschilov & Hutter. RAdam still seems to be doing better than AdamW so far, but the jumps imply possibl…
-
Hi,
this is under H2O Flow webinteface. For deep learning grid search and single model search. The "overwrite_with_best_model" is sometimes missing or true/false switches.
If the button is clicked i…
-
Hello @ruotianluo and thanks for your code.
I've seen that some papers report the results of the pure transformer applied for image captioning around 1.285 and some 1.29. For example, in the paper **…
-
### Team Name:
zer0dynamics
### Project Description:
Gradient ascent in function space (GRAFS) [1] is an algorithm for optimal control synthesis that leverages functional expansions of cont…
-
According to this blogpost: http://www.fast.ai/2018/07/02/adam-weight-decay/ and mentioned article https://arxiv.org/abs/1711.05101, Adam has problems when used with L2 regularization. If i understand…
-
**Aim**
Find out what layer-norm actually does (ie. benefits, limitations) and why/how it's applied to transformers.
**Plan**
- [ ] [Understanding the Difficulty of Training Transformers](https:/…
-
### Abstract
This proposal aims to transition the Course About, Catalog, and Index pages of the Open edX platform from legacy architecture to MFE. Historically, catalog and about pages are provided s…
-
The Theme for the Project: “An AI Solution for Communities”.
According to the father of Artificial Intelligence (AI), John McCarthy, it is “The science and engineering of making intelligent machine…
-
Using Dropout in child_model shows great works on prevent overfitting, however it also cause the final performance on model change significantly during each training with same hyper-params. It is too …