Checkpoint and Continuation of training

asigalov61 / Morpheus

[DEPRECIATED] Symbolic MIDI Music AI implementation

Apache License 2.0

17 stars 3 forks source link

Checkpoint and Continuation of training #1

Closed Landers125 closed 2 years ago

Landers125 commented 2 years ago

What saves every 4000 points is great, but what about the continuation after the failure?

Landers125 commented 2 years ago

For the free Kolab K80 put a batch sise 8, 3-4 hours train, then breaks off. How would we continue?

asigalov61 commented 2 years ago

@Landers125 Great question!!! I was thinking about implementing state save so that training can be continued after the failure. It would be quite useful indeed.

Unfortunately, it is not very straightforward to do so atm it is not possible AFAIK. Sorry :(

I will definitely post an update to the implementation if I will ever do it...

For now, you can try using paperspace.com (they give like 6 hour free runtimes), use a smaller dataset/model size, and you can also find an inexpensive GPU plan or something like that...

Hope this answers your questions.

Alex

Landers125 commented 2 years ago

@Landers125 Great question!!! I was thinking about implementing state save so that training can be continued after the failure. It would be quite useful indeed.

Unfortunately, it is not very straightforward to do so atm it is not possible AFAIK. Sorry :(

I will definitely post an update to the implementation if I will ever do it...

For now, you can try using paperspace.com (they give like 6 hour free runtimes), use a smaller dataset/model size, and you can also find an inexpensive GPU plan or something like that...

Hope this answers your questions.

Alex

number_of_batches = 14 2022-01-12_20-47-10 Thanks! Kaggle has launched a training session.

You are doing a very useful thing!

asigalov61 commented 2 years ago

@Landers125 Thank you. I am happy that you enjoy my work. It means a lot to me :)

Yes, Kaggle and some other companies like paperspace offer GPU plans/free GPUs that are better than Google. I am happy you found a good solution for your needs.

Alex

asigalov61 commented 2 years ago

@Landers125 Btw, you can technically restart the training after failure by loading the last checkpoint and the original dataset.

You can even set the final learning rate in the training section of the code.

The problem is that it will start training from the beginning of the dataset which will be kinda redundant and not very effective.

I will look into it some more soon I hope and I will add it to the implementation if it will be possible.