I just pushed a new commit - the data is now generated in a slightly different manner to how it was before.
I believe that this is closer to the implementation from the paper. I only trained the model for 4 epochs, but the loss seems to improve, even if very slowly. Also, this seems to be more aligned with the implementation as in the paper - in private correspondence with the author they stated the model took ~a week to train on a single GPU with 500epochs, we are now at ~20 minutes per epoch so that seems more consistent.
I think we are on the right track here. The next step now is to apply some training techniques to help the model to train (data normalization, possibly more involved way of sampling batches). I am also continuing to reread the paper (and related papers) to see where we might have diverged from the implementation as in the paper.
I just pushed a new commit - the data is now generated in a slightly different manner to how it was before.
I believe that this is closer to the implementation from the paper. I only trained the model for 4 epochs, but the loss seems to improve, even if very slowly. Also, this seems to be more aligned with the implementation as in the paper - in private correspondence with the author they stated the model took ~a week to train on a single GPU with 500epochs, we are now at ~20 minutes per epoch so that seems more consistent. I think we are on the right track here. The next step now is to apply some training techniques to help the model to train (data normalization, possibly more involved way of sampling batches). I am also continuing to reread the paper (and related papers) to see where we might have diverged from the implementation as in the paper.