aikupoker / deeper-stacker

DeeperStacker: DeepHoldem Evil Brother
38 stars 3 forks source link

main_train.lua seems to not having result or error #15

Closed airdine closed 4 years ago

airdine commented 4 years ago

Hello,

I see a same issue closed but it doesn't help me knowing what's wrong.

After generated data and converted them, I was trying to train the model but it seems like nothing happened :

$ th Training/main_train.lua 4
Loading Net Builder
166858 all good files
Erreur de segmentation (core dumped)

Except the segmentation fail, I don't have any error or result. I don't know what log file I coun't check either The script seems to stop in the file train.lua line 61:

local loss = M.criterion:forward(outputs, targets, mask)

Is someone know what could be the thing ?

Thanks in advance.

aikupoker commented 4 years ago

You don't specify if you are using GPU or CPU.

By default, deeper-stacker uses GPU. You will need to have enough memory in your GPU to make it work.

Could you share the output of the following commands?

$ nvcc --version
$ nvidia-smi
airdine commented 4 years ago

Hey, thanks for your reply !

I was using GPU and didn't noticed it because as your said deeper-stack use it by default, sorry.

here are the config output :

$ lsb_release -a
LSB Version:    core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:    18.04
Codename:   bionic
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88
$ nvidia-smi

Sat Apr  4 19:57:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+========= ============|
|   0  GeForce GTX 106...  Off  | 00000000:65:00.0  On |                  N/A |
|  0%   51C    P0    28W / 120W |    179MiB /  3016MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1410      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      1446      G   /usr/bin/gnome-shell                           9MiB |
|    0      2605      G   /usr/lib/xorg/Xorg                            71MiB |
|    0      2698      G   /usr/bin/gnome-shell                          76MiB |
+-----------------------------------------------------------------------------+
aikupoker commented 4 years ago

I think your GPU doesn't have enough memory.

$ nvidia-smi -l 1

Launch this in one terminal and in the other terminal, launch the neuronal network training. You will need to observe if it fills all the memory and when that happens, the lua script fails.

airdine commented 4 years ago

Thanks you,

I'll try that as soon as I have my data generated with me and I'll post the result.

I'll try with CPU too and comment the result if I can.

aikupoker commented 4 years ago

I'll just try without generating more data and I won't use CPU. 👍

airdine commented 4 years ago

Hello,

Here is my feedback: I reduced the amount of data generated files and run

$ th Training/main_train.lua 4
Loading Net Builder
103328 all good files
Erreur de segmentation (core dumped)

It's still the same. While running training part :

$ nvidia-smi -l 1

output.txt

Memory-usage get up until 981MiB /3016MiB then the training part stop and Memory-Usage get back to 300MiB (gdm3 usage)

Do you still think it's an out of memory issue ?

Thank you for you interest.

aikupoker commented 4 years ago

Did you check this issue? Maybe it could be related with your problems: https://github.com/happypepper/DeepHoldem/issues/8

airdine commented 4 years ago

Hello, thank for the link,

I didn't found something helping my issue except having a fresh OS install but not tried yet.

after looking at this comment : https://github.com/happypepper/DeepHoldem/issues/8#issuecomment-466355094

Which OS do you use to run this ?

Thank you for your advises

aikupoker commented 4 years ago

Try Ubuntu 16.04 OS or nvidia-docker image with 16.04.

airdine commented 4 years ago

Hey,

Fresh 16.04 install fix it, I really don't know what was wrong with my 18.04.

Thanks you very much, I can close this issue.