Training process killed

yanshengjia commented 5 years ago

I tried to train transformer model on my own parallel corpus (about 250MB).

But after the graph is constructed, the process is killed before session started.

Graph loaded
WARNING:tensorflow:From train.py:171: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-11-27 12:32:22.021904: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-27 12:32:22.279206: I tensorflow/compiler/xla/service/service.cc:149] XLA service 0x5607d324dc90 executing computations on platform CUDA. Devices:
2018-11-27 12:32:22.279319: I tensorflow/compiler/xla/service/service.cc:157]   StreamExecutor device (0): Tesla P100-PCIE-12GB, Compute Capability 6.0
2018-11-27 12:32:22.286826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:04:00.0
totalMemory: 11.91GiB freeMemory: 10.98GiB
2018-11-27 12:32:22.286958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2018-11-27 12:32:22.288905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 12:32:22.288978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2018-11-27 12:32:22.289007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2018-11-27 12:32:22.289527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10682 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0)
Killed

Any ideas?

ccnankai commented 5 years ago

H i @yanshengjia
have you solved this problem?

angyee commented 5 years ago

same problem anyone solved?

@yanshengjia @ccnankai @kimdwkimdw @maximedb @Kyubyong

maximedb commented 5 years ago

Dis you try reducing the model size ?

Kyubyong / transformer

Training process killed #57