Closed Cookize closed 6 months ago
Hi! very glad to hear you are using our repo and experimenting with SPT. With SPT you should see an increase in performance very early in training if your hyperparameters are tuned, you can check out the ones we used in the paper, appendix C.1.
For the issue mentioned, it's hard to see from just looking at the config file, could help if you can share the run command as well.
If I understand correctly, you are pretraining with a next token prediction objective. If that is the case, please note that you are allowing the model to attend both backward and forward windows (i.e future tokens), which we didn't try, I'm guessing this leads to less effective pretraining.
For PathX the default config uses chunked attention due to the long sequence length. For causal training you need to specify the mode explicitly and the windows to attend by:
model.layer.causal=true model.layer.look_forward=0 model.layer.look_backward=<num_windows>
In the experiments listed in the paper I always used <num_windows>=1
.
I hope this solves the issue, please let me know otherwise.
Hi, we tried to reproduce the results of Transformer model on Pathx as reported in the paper (Transformer + Causal SPT), but could not achieve the corresponding results.
The result reported in the paper is 88.05.
In our experiments, whether SPT is performed or not, the Transformer model only achieves random guessing on Pathx. We wondered if there was something wrong with the way we launched the experiments.
Here are some details.
First, we started SPT with the following configuration:
The curves during training are as follows:
Train loss: Performance on val set:
After training 6 epochs, the accuracy of next token prediction reached 99% and we considered it converged.
Then, we fine-tuned the pretrained model above on pathx. The configuration is as follow:
The curves during training are as follows: Train loss: Performance on val set:
In the FT phase, the loss didn't decrease and the model was maintained at randomized guesses.
We would like to know if these training curves are correct. Could you provide detailed parameters that reproduce the results of this experiment in the paper, including the number of training epochs in the SPT and FT phases?