Closed herrefirh closed 5 years ago
You can check all steps in the README.md
: https://github.com/aikupoker/deeper-stacker#creating-your-own-models
To summarize all steps:
You are going to save the model each N times:
params.save_epoch = 1
By default, for each epoch it will be saved.
How many epoch are going to be?
params.epoch_count = 1000
How many samples are going to be used to train and validate for each batch?
params.train_batch_size = 10000
You are going to have three files:
At the end, if you finish all training, you will have 3x1,000 = 3,000 files. And you should pick one of them and you will have ready your river neuronal network.
You're boss, thank you!)
I am using several computers and instances generating the training data so it is spread out. But for training the model, I downloaded it and it is on one computer.
When I go back to generate data for the turn, I will distribute the model to all the computers. Do I need to send the total river data (inputs and targets) to all the computers as well to generate data for the turn? I ask because I have a slow internet connection. If I only need to send the model it'd be faster.
No, you don't have to send the training river data, just river network.
Thx
So now it's been running for approx 60 hours.
Training loss : 0.047943 min: 0.043298 max: 0.057413 learningRate: 0.000100
Validation loss: 0.071183 min: 0.061998 max: 0.074972
Validation progress: 0.101000 Last minimum found: 0 epoch back
Epoch took: 3915.648145 Timestamp: 08:27 +2h next time: 09:32
54 / 1000
SAVING MODEL
SAVED
But here's the thing (from the readme)...
Network | # samples | # poker situations | Validation huber loss | Epoch
River network | 100,000 | 1,000,000 | 0.0415 | 54
So my Training Loss is similar to the Validation Huber Loss. But my validation loss is way higher. Is Training Loss = Validation Huber Loss? Is it possible that in the readme it is wrong?
Here's another possibility/question, from PDF: Training used a mini-batch size of 1,000, and a learning rate 0.001, which was decreased to 0.0001 after the first 200 epochs. Networks were trained for approximately 350 epochs over two days on a single GPU, and the epoch with the lowest validation loss was chosen
I am running the version of deeper stacker that had batch size of 10,000. I noticed it was changed to 1,000 today or last night. Does this explain the difference? Pdf has over 5x as many epochs in less time, with single gpu.
Can I stop it, update the code, and resume where the training where I left off?
To add to this, I stopped training, because it was only getting worse with each additional iteration. The training loss is 0.048 at 54 iterations. This is close to the readme, but again, training loss, not validation loss...
{ gpu : true valid_loss : 0.071182658773192 epoch : 54 learningRate : 0.0001 }
Training loss : 0.047717 min: 0.043629 max: 0.052792 learningRate: 0.000100
Validation loss: 0.071317 min: 0.062232 max: 0.075300
Validation progress: -0.189000 Last minimum found: 1 epoch back
Epoch took: 4137.798152 Timestamp: 09:36 +2h next time: 10:45
55 / 1000
SAVING MODEL
SAVED
Training loss : 0.047573 min: 0.043364 max: 0.053310 learningRate: 0.000100
Validation loss: 0.071277 min: 0.062033 max: 0.075256
Validation progress: -0.133000 Last minimum found: 2 epoch back
Epoch took: 4431.330257 Timestamp: 10:50 +2h next time: 12:04
56 / 1000
SAVING MODEL
SAVED
Training loss : 0.047463 min: 0.043327 max: 0.053187 learningRate: 0.000100
Validation loss: 0.071387 min: 0.062051 max: 0.075542
Validation progress: -0.287000 Last minimum found: 3 epoch back
Epoch took: 4008.790846 Timestamp: 11:57 +2h next time: 13:03
57 / 1000
SAVING MODEL
SAVED
Looks like I'll be retraining the model. Look at the difference between the first epoch of the previous version and the current Old:
Loading Net Builder
101412 all good files
Training loss : 0.330624 min: 0.272472 max: 0.582653 learningRate: 0.001000
Validation loss: 0.282153 min: 0.271671 max: 0.293777
Validation progress: 254.417000 Last minimum found: 0 epoch back
Epoch took: 7873.572798 Timestamp: 22:56 +2h next time: 01:07
1 / 1000
SAVING MODEL
SAVED
Current:
Loading Net Builder
101412 all good files
Training loss : 0.192641 min: 0.089994 max: 0.607257 learningRate: 0.001000
Validation loss: 0.126292 min: 0.097691 max: 0.174276
Validation progress: 691.813000 Last minimum found: 0 epoch back
Epoch took: 6216.193645 Timestamp: 15:04 +2h next time: 16:48
1 / 500
SAVING MODEL
SAVED
Also don't you think params.decrease_learning_at_epoch = 200 should be params.decrease_learning_at_epoch = 50 or no? i see why, 200, to make it consistent with the research, but i'll let you know what i get to 50 when it starts spinning its wheels
I retrained the river, with params.decrease_learning_at_epoch = 50 params.train_batch_size = 1000
Training loss : 0.029432 min: 0.018800 max: 0.053388 learningRate: 0.000100
Validation loss: 0.055863 min: 0.030285 max: 0.085242
Validation progress: 0.136000 Last minimum found: 0 epoch back
Epoch took: 4143.692887 Timestamp: 13:01 +2h next time: 14:10
59 / 500
SAVING MODEL
SAVED
Much better, but strange that in the readme at epoch 54 vhl was supposively 0.415 where at epoch 60 I was at 0.0558
I don't know what would account for that if I used the same settings and same number of poker situations
Update 1: After I hit the submit button I went back and checked and there was an update to the logfile:
Training loss : 0.330624 min: 0.272472 max: 0.582653 learningRate: 0.001000
So it's not dead. I find it strange it doesn't use CPU or GPU processor, but at least it's not dead.
Update 2: Now I have epoch files (model,info,info.txt)
Is all that normal?