amazon-archives / amazon-dsstne

Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models
Apache License 2.0
4.41k stars 731 forks source link

Segmentation fault #62

Closed oyotong closed 8 years ago

oyotong commented 8 years ago

Hi,

I completed the training then performed the predicting, but got below exception: Do you have any suggestions?

BTW, just one neural network got this error, others is OK.

=========== exception messages ========= Exported gl_input_predict.samplesIndex with 65075 entries. Raw max index is: 65064 Rounded up max index to: 65152 Created NetCDF file gl_input_predict.nc for dataset gl_input Number of network input nodes: 65064 Number of entries to generate predictions for: 65075 LoadNetCDF: Loading UInt data set NNDataSet::NNDataSet: Name of data set: gl_input NNDataSet::NNDataSet: Attributes: Sparse Boolean NNDataSet::NNDataSet: 1-dimensional data comprised of (65152, 1, 1) datapoints. NNDataSet::NNDataSet: 3778407 total datapoints. NNDataSet::NNDataSet: 65075 examples. [snx-dsstne:02608] * Process received signal * [snx-dsstne:02608] Signal: Segmentation fault (11) [snx-dsstne:02608] Signal code: Address not mapped (1) [snx-dsstne:02608] Failing at address: 0xb3a1840 [snx-dsstne:02608] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f02cc834330] [snx-dsstne:02608] [ 1] predict[0x430d26] [snx-dsstne:02608] [ 2] predict[0x453fa0] [snx-dsstne:02608] [ 3] predict[0x42a87b] [snx-dsstne:02608] [ 4] predict[0x408307] [snx-dsstne:02608] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f02cc480f45] [snx-dsstne:02608] [ 6] predict[0x40aab1] [snx-dsstne:02608] * End of error message * Segmentation fault (core dumped)

tristanpenman commented 8 years ago

A quick glance at the code for the predict application suggests a couple of possible causes - we'll need to narrow this down.

We can dig into this further by rebuilding DSSTNE with the DEBUG flag enabled - the flag can be found in Makefile.inc under /src/amazon/dsstne, near the beginning of that file. If you can reproduce the issue on a debug build, the seg fault output should contain line numbers that will help narrow down the potential causes.

Be sure to run make clean before running make again.

Any other information you can provide (e.g. OS/distro, GPU used) would also be helpful.

oyotong commented 8 years ago

I enabled the debug flag as below, but I could not get more detail debug information. Env info: OS: Ubuntu 14.04 CUDA: release 7.5, V7.5.17, NVIDIA-SMI 352.39 GPU: GeForce GTX 970

===== Makefile.inc [start] ===== .... CPPFLAGS = -traditional -P -std=c++0x -DMEMTRACKING -gdwarf-3 .... DEBUG = 1 ifeq ($(DEBUG), 1) $(info \ DEBUG mode **) CFLAGS = -DOMPI_SKIP_MPICXX -std=c++0x -g -O0 -DMEMTRACKING -gdwarf-3 else .... ===== Makefile.inc [end] =====

===== Make Info [start] ===== \ DEBUG mode ** make[1]: Entering directory `/home/dsstne/amazon-dsstne/src/amazon/dsstne/utils' ===== Make Info [end] =====

===== Exception Message [start] ===== GpuContext::Startup: Process 0 out of 1 initialized. Allocating 8 bytes of GPU memory Mem++: 8 8 GpuContext::Startup: Single node flag on GPU for process 0 is 1 GpuContext::Startup: P2P support flags on GPU for process 0 are 1 1 GpuContext::Startup: GPU for process 0 initialized. GpuContext::SetRandomSeed: Random seed set to 12134. Loaded input feature index with 65064 entries. Indexing 1 files Indexing file: dss_sku_sku Progress Parsing10000Time 1.0682 Progress Parsing20000Time 1.0648 Progress Parsing30000Time 0.959654 Progress Parsing40000Time 0.987968 Progress Parsing50000Time 0.783489 Progress Parsing60000Time 0.800305 Exported gl_input_predict.samplesIndex with 65075 entries. Raw max index is: 65064 Rounded up max index to: 65152 Created NetCDF file gl_input_predict.nc for dataset gl_input Number of network input nodes: 65064 Number of entries to generate predictions for: 65075 LoadNetCDF: Loading UInt data set NNDataSet::NNDataSet: Name of data set: gl_input NNDataSet::NNDataSet: Attributes: Sparse Boolean NNDataSet::NNDataSet: 1-dimensional data comprised of (65152, 1, 1) datapoints. NNDataSet::NNDataSet: 3778407 total datapoints. NNDataSet::NNDataSet: 65075 examples. [snx-dsstne:04470] * Process received signal * [snx-dsstne:04470] Signal: Segmentation fault (11) [snx-dsstne:04470] Signal code: Address not mapped (1) [snx-dsstne:04470] Failing at address: 0xc5f77f0 [snx-dsstne:04470] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7fd1a2766330] [snx-dsstne:04470] [ 1] predict[0x447eb7] [snx-dsstne:04470] [ 2] predict[0x43714c] [snx-dsstne:04470] [ 3] predict[0x431088] [snx-dsstne:04470] [ 4] predict[0x42e1f8] [snx-dsstne:04470] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fd1a23b2f45] [snx-dsstne:04470] [ 6] predict[0x407d31] [snx-dsstne:04470] * End of error message * Segmentation fault (core dumped) ===== Exception Message [end] =====

scottlegrand commented 8 years ago

Never mind what I wrote, could you run this from gdb?

It looks like the dataset is corrupted somehow to me.

oyotong commented 8 years ago

Run this from gdb and got below info:

Starting program: /home/dsstne/amazon-dsstne/src/amazon/dsstne/bin/predict -b 256 -d gl -i features_input -o features_output -k 10 -n gl.nc -f dss_sku_sku -s recs -r dss_sku_sku [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [New Thread 0x7fffee132700 (LWP 4552)] GpuContext::Startup: Process 0 out of 1 initialized. [New Thread 0x7fffe598a700 (LWP 4553)] [New Thread 0x7fffdcfff700 (LWP 4554)] Allocating 8 bytes of GPU memory Mem++: 8 8 GpuContext::Startup: Single node flag on GPU for process 0 is 1 GpuContext::Startup: P2P support flags on GPU for process 0 are 1 1 GpuContext::Startup: GPU for process 0 initialized. GpuContext::SetRandomSeed: Random seed set to 12134. Loaded input feature index with 65064 entries. Indexing 1 files Indexing file: dss_sku_sku Progress Parsing10000Time 1.07443 Progress Parsing20000Time 1.07139 Progress Parsing30000Time 0.968824 Progress Parsing40000Time 0.994079 Progress Parsing50000Time 0.787785 Progress Parsing60000Time 0.80526 Exported gl_input_predict.samplesIndex with 65075 entries. Raw max index is: 65064 Rounded up max index to: 65152 Created NetCDF file gl_input_predict.nc for dataset gl_input Number of network input nodes: 65064 Number of entries to generate predictions for: 65075 LoadNetCDF: Loading UInt data set NNDataSet::NNDataSet: Name of data set: gl_input NNDataSet::NNDataSet: Attributes: Sparse Boolean NNDataSet::NNDataSet: 1-dimensional data comprised of (65152, 1, 1) datapoints. NNDataSet::NNDataSet: 3778407 total datapoints. NNDataSet::NNDataSet: 65075 examples.

Program received signal SIGSEGV, Segmentation fault. 0x0000000000447eb7 in NNDataSet::CalculateSparseDatapointCounts (this=0x8b69a40) at NNTypes.cpp:868

868 _vSparseDatapointCount[x]++;

scottlegrand commented 8 years ago

Awesome, so looking at that section: // Calculate individual counts for each datapoint uint64_t N = _width * _height * _length; _vSparseDatapointCount.resize(N);
std::fill(_vSparseDatapointCount.begin(), _vSparseDatapointCount.end(), 0);
for (auto x : _vSparseIndex) { _vSparseDatapointCount[x]++; } You have a sparse index that is out of range, can you check that all your indices in

vector _vSparseIndex

are < 65152 because I'm betting that they're not... Or in this case just test x.

oyotong commented 8 years ago

Is it an issue? How to fix/bypass?

rgeorgej commented 8 years ago

Can u send us the steps you did and also a sampled data

scottlegrand commented 8 years ago

Yes, the dataset appears to be corrupted with out of range indices. How exactly was the dataset generated?

Also we should add guard code to detect this situation but we still have to fix the data set

tristanpenman commented 8 years ago

I'm going to jump in here and suggest that once we understand how the data was generated, this could serve as the basis for some good unit tests.

oyotong commented 8 years ago

You can get the dataset from here: -- Coud you help to test? https://s3.amazonaws.com/andy.tang.test/dataset.zip

generateNetCDF -d gl_input -i dss_sku_sku -o gl_input.nc -f features_input -s samples_input -c generateNetCDF -d gl_output -i dss_sku_sku -o gl_output.nc -f features_output -s samples_input -c train -c config.json -i gl_input.nc -o gl_output.nc -n gl.nc -b 256 -e 10 predict -b 256 -d gl -i features_input -o features_output -k 10 -n gl.nc -f dss_sku_sku -s recs -r dss_sku_sku

scottlegrand commented 8 years ago

Interesting, I get a different sized dataset. ./generateNetCDF -d gl_input -i dss_sku_sku -o gl_input.nc -f features_input -s samples_input -c Flag -c is set. Will create a new feature file and overwrite: features_input Generating dataset of type: indicator Will create a new samples index file: samples_input Will create a new features index file: features_input Indexing 1 files Indexing file: dss_sku_sku Progress Parsing10000Time 0.827208 Progress Parsing20000Time 0.749772 Progress Parsing30000Time 0.670679 Progress Parsing40000Time 0.685743 Progress Parsing50000Time 0.54209 Progress Parsing60000Time 0.556289 Exported features_input with 65217 entries. Exported samples_input with 65075 entries. Raw max index is: 65217 Rounded up max index to: 65280 Created NetCDF file gl_input.nc for dataset gl_input Total time for generating NetCDF: 4.54689 secs.

Can you pull ToT, rebuild, and try again?

tristanpenman commented 8 years ago

I have been able to reproduce this issue on the DSSTNE AMI running on a g2.2xlarge EC2 instance, with the dataset provided. What I found is that while the predict utility is correctly loading all 65075 lines of the feature_input file, some of those lines contain duplicate IDs.

Line 64175, for example, is malformed. With hidden/control characters enabled in vi, you can see the formatting error (a second tab character):

4549498^I4549491^I4549528,10.0:4549526,10.0:4549498,10.0:4549501,10.0$

Both the generateNetCDF and predict applications should be able to detect this kind of error, and I will raise a separate issue to track that work. In the mean time, this should help you to fix the dataset itself.

oyotong commented 8 years ago

Thank you for your help!!

I fixed those malformed data. It work fine now.