Open zhixingheyi1102 opened 2 months ago
Hello. In practice you can use a single 40GB A100 card to most of the experiments in this paper, but we used 2-4 A100s to speed up training.
You can always use less parameters (layers/heads). It depends on what data you want to train on. The more data you have, the more beneficial it will be to have a larger model, following scaling laws.
Hello, I would like to know what computational scale is required to train this model?