Closed weiaiF closed 7 months ago
Hi, thanks for the question! In our work we stroke a balance between performance and simplicity: for better convergence of the backbone we trained the diffusion decoder and the backbone separately, and for simplicity we did not make it a three-stage training process (in other words, we did not add a final fine-tuning stage which trains the diffusion decoder and transformer backbone together). However, we have been conducting experiments (these experiments include two cases that may help answer the question in detail: 1) to train diffusion decoder and transformer backbone together from scratch, and 2) to add a fine-tuning stage after the two-stage training process). The results will be available in a week and we will add a further comment.
We have uploaded the checkpoint for STR(CKS)-16M and added the performance to the new performance section. Closing this issue. Reopen if you have more questions.
I have a question about the mentioned two-stage training process. What is the final model performance if you train the diffusion decoder and transformer backbone together