HuwCampbell / grenade

Deep Learning in Haskell
BSD 2-Clause "Simplified" License
1.45k stars 84 forks source link

Bug: network for XOR doesn't train correctly #66

Closed CSVdB closed 6 years ago

CSVdB commented 6 years ago

I'm experiencing a bug and can't trace where it's coming from: my networks aren't actually training, they just seem to randomly change the network parameters. I simplified my code until now, all I'm using are the 'randomNetwork' and 'train' functions, yet the bug seems to persist.

Details: You can find the code for the bug on https://github.com/Nickske666/grenade-examples/tree/bug in app/main.hs. This executable only depends on your grenade (master branch, latest version). I'm trying to train a two layer fully connected NN to approximate XOR. Here is the output before and after 100000 train-loops:

Before training:
S1D (-0.6153713089038197 :: R 1)
S1D (-0.6227542569188731 :: R 1)
S1D (-0.6152355742354048 :: R 1)
S1D (-0.6286478521211926 :: R 1)
After training:
S1D (0.3985667983943745 :: R 1)
S1D (0.4880564046752094 :: R 1)
S1D (0.5148131666098358 :: R 1)
S1D (0.5420005167827723 :: R 1)

This shows the network's predictions on the vectors [0, 0], [0, 1], [1, 0] and [1, 1], which should be 0 1 1 0. However, as you can see, it's not even close. I turned off the momentum and regulator for this, and optimised the learning rate.

Is this a bug, or did I make a mistake here?

HuwCampbell commented 6 years ago

The range of tanh is (-1,1), and it's quite unstable around 0.

I would certainly first replace Tanh with a sigmoid activation (which I foolishly called Logit).

CSVdB commented 6 years ago

With Logit layers instead of Tanh, and learning rate 5e-4, this is the output.

Before training: S1D (0.5209612495165322 :: R 1) S1D (0.5354862396464221 :: R 1) S1D (0.5147895724365512 :: R 1) S1D (0.5286280277690646 :: R 1) After training: S1D (0.49562255798747007 :: R 1) S1D (0.5116790380511227 :: R 1) S1D (0.48814828705706076 :: R 1) S1D (0.5032045153404426 :: R 1)

Again all the outcomes simply went down, with approximately the same amount (0.02 - 0.03), instead of training properly.

Any other ideas?

HuwCampbell commented 6 years ago

I changed a few things, using Tanh then Logit, regularising, and changing the number of passes.

 type Net                                                     
      = Network '[ FullyConnected 2 2, Tanh, FullyConnected 2 1, Logit] '[ 'D1 2, 'D1 2, 'D1 2, 'D1 1, 'D1 1]                

 main :: IO ()                                                
 main = do                                                    
     let samples = take 500000 $ cycle $ zip inputs outputs   
         params = LearningParameters 0.005 1e-8 1e-8          
     net <- randomNetworkM                                    
     putStrLn "Before training:"                              
     print $ snd $ runNetwork net $ S1D $ vec2 0 0            
     print $ snd $ runNetwork net $ S1D $ vec2 0 1            
     print $ snd $ runNetwork net $ S1D $ vec2 1 0            
     print $ snd $ runNetwork net $ S1D $ vec2 1 1            
     let trained =                                            
             foldl'                                           
                 (\net (inpt, outpt) -> train params net inpt outpt)                                                         
                 net                                          
                 samples                                      
     putStrLn "After training:"                               
     print $ snd $ runNetwork trained $ S1D $ vec2 0 0        
     print $ snd $ runNetwork trained $ S1D $ vec2 0 1        
     print $ snd $ runNetwork trained $ S1D $ vec2 1 0        
     print $ snd $ runNetwork trained $ S1D $ vec2 1 1 

Gives

>> :main                                                      
Before training:                                              
S1D (0.3277539706087074 :: R 1)                               
S1D (0.40347581438397084 :: R 1)                              
S1D (0.21913306200165242 :: R 1)                              
S1D (0.26255544780363543 :: R 1)                              
After training:                                               
S1D (2.508754710817976e-2 :: R 1)
S1D (0.9678709914342344 :: R 1)
S1D (0.9677786822228179 :: R 1)
S1D (2.218035165797253e-2 :: R 1)