Training Model Not Doing Anything?

herrefirh commented 5 years ago

Update 1: After I hit the submit button I went back and checked and there was an update to the logfile:

Training loss : 0.330624 min: 0.272472 max: 0.582653 learningRate: 0.001000

So it's not dead. I find it strange it doesn't use CPU or GPU processor, but at least it's not dead.

Update 2: Now I have epoch files (model,info,info.txt)

Validation loss: 0.282153  min: 0.271671  max: 0.293777 
Validation progress: 254.417000     Last minimum found: 0 epoch back    
Epoch took: 7873.572798  Timestamp: 22:56 +2h   next time: 01:07        
1 / 1000

Is all that normal?

Original message deleted to save space

aikupoker commented 5 years ago

You can check all steps in the README.md: https://github.com/aikupoker/deeper-stacker#creating-your-own-models

To summarize all steps:

Create training data. Folder: river_raw
Bucket training data. Folder: river
Create model: Read all training data from river folder and start training model.

You are going to save the model each N times:

params.save_epoch = 1

By default, for each epoch it will be saved.

How many epoch are going to be?

params.epoch_count = 1000

How many samples are going to be used to train and validate for each batch?

params.train_batch_size = 10000

You are going to have three files:

Model (GPU model)
Info (GPU model info in th): you can extract Neuronal network information using this file.
Info.txt (GPU model info in human language).

At the end, if you finish all training, you will have 3x1,000 = 3,000 files. And you should pick one of them and you will have ready your river neuronal network.

herrefirh commented 5 years ago

You're boss, thank you!)

herrefirh commented 5 years ago

I am using several computers and instances generating the training data so it is spread out. But for training the model, I downloaded it and it is on one computer.

When I go back to generate data for the turn, I will distribute the model to all the computers. Do I need to send the total river data (inputs and targets) to all the computers as well to generate data for the turn? I ask because I have a slow internet connection. If I only need to send the model it'd be faster.

aikupoker commented 5 years ago

No, you don't have to send the training river data, just river network.

herrefirh commented 5 years ago

Thx

herrefirh commented 5 years ago

So now it's been running for approx 60 hours.

Training loss  : 0.047943  min: 0.043298  max: 0.057413  learningRate: 0.000100 
Validation loss: 0.071183  min: 0.061998  max: 0.074972 
Validation progress: 0.101000     Last minimum found: 0 epoch back      
Epoch took: 3915.648145  Timestamp: 08:27 +2h   next time: 09:32        
54 / 1000       
SAVING MODEL    
SAVED

But here's the thing (from the readme)...

Network          | # samples | # poker situations | Validation huber loss | Epoch
River network | 100,000     | 1,000,000             | 0.0415                       | 54

So my Training Loss is similar to the Validation Huber Loss. But my validation loss is way higher. Is Training Loss = Validation Huber Loss? Is it possible that in the readme it is wrong?

Here's another possibility/question, from PDF: Training used a mini-batch size of 1,000, and a learning rate 0.001, which was decreased to 0.0001 after the first 200 epochs. Networks were trained for approximately 350 epochs over two days on a single GPU, and the epoch with the lowest validation loss was chosen

I am running the version of deeper stacker that had batch size of 10,000. I noticed it was changed to 1,000 today or last night. Does this explain the difference? Pdf has over 5x as many epochs in less time, with single gpu.

Can I stop it, update the code, and resume where the training where I left off?

herrefirh commented 5 years ago

To add to this, I stopped training, because it was only getting worse with each additional iteration. The training loss is 0.048 at 54 iterations. This is close to the readme, but again, training loss, not validation loss...

{ gpu : true valid_loss : 0.071182658773192 epoch : 54 learningRate : 0.0001 }

Training loss  : 0.047717  min: 0.043629  max: 0.052792  learningRate: 0.000100 
Validation loss: 0.071317  min: 0.062232  max: 0.075300 
Validation progress: -0.189000     Last minimum found: 1 epoch back     
Epoch took: 4137.798152  Timestamp: 09:36 +2h   next time: 10:45        
55 / 1000       
SAVING MODEL    
SAVED   
Training loss  : 0.047573  min: 0.043364  max: 0.053310  learningRate: 0.000100 
Validation loss: 0.071277  min: 0.062033  max: 0.075256 
Validation progress: -0.133000     Last minimum found: 2 epoch back     
Epoch took: 4431.330257  Timestamp: 10:50 +2h   next time: 12:04        
56 / 1000       
SAVING MODEL    
SAVED   
Training loss  : 0.047463  min: 0.043327  max: 0.053187  learningRate: 0.000100 
Validation loss: 0.071387  min: 0.062051  max: 0.075542 
Validation progress: -0.287000     Last minimum found: 3 epoch back     
Epoch took: 4008.790846  Timestamp: 11:57 +2h   next time: 13:03        
57 / 1000       
SAVING MODEL    
SAVED

herrefirh commented 5 years ago

Looks like I'll be retraining the model. Look at the difference between the first epoch of the previous version and the current Old:

Loading Net Builder     
101412 all good files   
Training loss  : 0.330624  min: 0.272472  max: 0.582653  learningRate: 0.001000 
Validation loss: 0.282153  min: 0.271671  max: 0.293777 
Validation progress: 254.417000     Last minimum found: 0 epoch back    
Epoch took: 7873.572798  Timestamp: 22:56 +2h   next time: 01:07        
1 / 1000        
SAVING MODEL    
SAVED

Current:

Loading Net Builder     
101412 all good files   
Training loss  : 0.192641  min: 0.089994  max: 0.607257  learningRate: 0.001000 
Validation loss: 0.126292  min: 0.097691  max: 0.174276 
Validation progress: 691.813000     Last minimum found: 0 epoch back    
Epoch took: 6216.193645  Timestamp: 15:04 +2h   next time: 16:48        
1 / 500 
SAVING MODEL    
SAVED

herrefirh commented 5 years ago

Also don't you think params.decrease_learning_at_epoch = 200 should be params.decrease_learning_at_epoch = 50 or no? i see why, 200, to make it consistent with the research, but i'll let you know what i get to 50 when it starts spinning its wheels

herrefirh commented 5 years ago

I retrained the river, with params.decrease_learning_at_epoch = 50 params.train_batch_size = 1000

Training loss  : 0.029432  min: 0.018800  max: 0.053388  learningRate: 0.000100 
Validation loss: 0.055863  min: 0.030285  max: 0.085242 
Validation progress: 0.136000     Last minimum found: 0 epoch back      
Epoch took: 4143.692887  Timestamp: 13:01 +2h   next time: 14:10        
59 / 500        
SAVING MODEL    
SAVED

Much better, but strange that in the readme at epoch 54 vhl was supposively 0.415 where at epoch 60 I was at 0.0558

I don't know what would account for that if I used the same settings and same number of poker situations

aikupoker / deeper-stacker

Training Model Not Doing Anything? #6