amazon-archives / amazon-dsstne

Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models
Apache License 2.0
4.41k stars 731 forks source link

Test dsstne module via python fails #236

Open spacelover1 opened 4 years ago

spacelover1 commented 4 years ago

Hi, I'm trying to test dsstne module using python, following this document. Here's the complete error log: I also changed the alpha to a lower amount, but still getting malloc error. Any suggestions here?

NNLayer::Allocate: Allocating 3538944 bytes (864, 1024) of delta data for layer P3
NNLayer::Deallocate: Deallocating all data for layer Hidden10
NNLayer::Allocate: Allocating 524288 bytes (128, 1024) of unit data for layer Hidden10
NNLayer::Allocate: Allocating 524288 bytes (128, 1024) of delta data for layer Hidden10
NNLayer::Allocate: Allocating 524288 bytes (128, 1024) of dropout data for layer Hidden10
NNLayer::Deallocate: Deallocating all data for layer Output
NNLayer::Allocate: Allocating 40960 bytes (10, 1024) of unit data for layer Output
NNLayer::Allocate: Allocating 40960 bytes (10, 1024) of delta data for layer Output
NNDataSet<T>::Shard: Model Sharding sparse dataset output across all GPUs.
Getting algorithm between Input and C1
Getting algorithm between C1 and C1a
Getting algorithm between P1 and C2
Getting algorithm between C2 and C2a
Getting algorithm between P2 and C3
Getting algorithm between C3 and C3a
NNNetwork::RefreshState: Setting cuDNN workspace size to 4442259456 bytes.
GpuBuffer::Allocate failed (cudaMalloc) out of memory
python: GpuTypes.h:522: void GpuBuffer<T>::Allocate() [with T = unsigned char]: Assertion `0' failed.
[e501c14cf80e:19359] *** Process received signal ***
[e501c14cf80e:19359] Signal: Aborted (6)
[e501c14cf80e:19359] Signal code:  (-6)
[e501c14cf80e:19359] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fbe10a30390]
[e501c14cf80e:19359] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7fbe1068a428]
[e501c14cf80e:19359] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7fbe1068c02a]
[e501c14cf80e:19359] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2dbd7)[0x7fbe10682bd7]
[e501c14cf80e:19359] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2dc82)[0x7fbe10682c82]
[e501c14cf80e:19359] [ 5] /usr/local/lib/python2.7/dist-packages/dsstne.so(_ZN9NNNetwork12RefreshStateEv+0x4a0)[0x7fbdffeebe80]
[e501c14cf80e:19359] [ 6] /usr/local/lib/python2.7/dist-packages/dsstne.so(_ZN9NNNetwork5TrainEjfffff+0x69)[0x7fbdffef5289]
[e501c14cf80e:19359] [ 7] /usr/local/lib/python2.7/dist-packages/dsstne.so(_ZN18NNNetworkFunctions5TrainEP7_objectS1_+0xfb)[0x7fbdffe3cccb]
[e501c14cf80e:19359] [ 8] python(PyEval_EvalFrameEx+0x5ca)[0x4bc9ba]
[e501c14cf80e:19359] [ 9] python(PyEval_EvalCodeEx+0x306)[0x4ba036]
[e501c14cf80e:19359] [10] python[0x4eb32f]
[e501c14cf80e:19359] [11] python(PyRun_FileExFlags+0x82)[0x4e5592]
[e501c14cf80e:19359] [12] python(PyRun_SimpleFileExFlags+0x186)[0x4e3e46]
[e501c14cf80e:19359] [13] python(Py_Main+0x54e)[0x493ade]
[e501c14cf80e:19359] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fbe10675830]
[e501c14cf80e:19359] [15] python(_start+0x29)[0x4934a9]
[e501c14cf80e:19359] *** End of error message ***
Aborted (core dumped)