Closed Landers125 closed 2 years ago
For the free Kolab K80 put a batch sise 8, 3-4 hours train, then breaks off. How would we continue?
@Landers125 Great question!!! I was thinking about implementing state save so that training can be continued after the failure. It would be quite useful indeed.
Unfortunately, it is not very straightforward to do so atm it is not possible AFAIK. Sorry :(
I will definitely post an update to the implementation if I will ever do it...
For now, you can try using paperspace.com (they give like 6 hour free runtimes), use a smaller dataset/model size, and you can also find an inexpensive GPU plan or something like that...
Hope this answers your questions.
Alex
@Landers125 Great question!!! I was thinking about implementing state save so that training can be continued after the failure. It would be quite useful indeed.
Unfortunately, it is not very straightforward to do so atm it is not possible AFAIK. Sorry :(
I will definitely post an update to the implementation if I will ever do it...
For now, you can try using paperspace.com (they give like 6 hour free runtimes), use a smaller dataset/model size, and you can also find an inexpensive GPU plan or something like that...
Hope this answers your questions.
Alex
number_of_batches = 14 Thanks! Kaggle has launched a training session.
You are doing a very useful thing!
@Landers125 Thank you. I am happy that you enjoy my work. It means a lot to me :)
Yes, Kaggle and some other companies like paperspace offer GPU plans/free GPUs that are better than Google. I am happy you found a good solution for your needs.
Alex
@Landers125 Btw, you can technically restart the training after failure by loading the last checkpoint and the original dataset.
You can even set the final learning rate in the training section of the code.
The problem is that it will start training from the beginning of the dataset which will be kinda redundant and not very effective.
I will look into it some more soon I hope and I will add it to the implementation if it will be possible.
What saves every 4000 points is great, but what about the continuation after the failure?