GPUSPH on Nvidia Jetson TX2 (Tegra X2)

Currently I do not have a Nvidia GPU installed on my PC, nor do I have the capacity for one. I do however have a Nvidia Jetson TX2, which is an embedded platform hosting a CUDA enabled GPU (compute capability 6.2). On the Jetson, I have CUDA 9.0 installed via JetPack 3.2.1 (see Nvidia's website on embedded systems for more info). Based on the dependencies for running GPUSPH, I believe I meet the necessary requirements. Here are is the device info given by the script:

~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery'

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 7846 MBytes (8227401728 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            1301 MHz (1.30 GHz)
  Memory Clock rate:                             1600 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

I realize this may be the first you've heard of someone trying to run GPUSPH on this type of system and I didn't expect it to run out of the box. Reading through the Makefile supports this assumption, as I don't see anything relating to embedded platforms. That being said, I made an initiative to adjust the Makefile in hopes that I could successfully compile and run the software.

Carefully reading through the Makefile and comparing the execution results of "shell" commands against my system, I was able to ensure that all the necessary "includes" and "libs" were found. The only adjustment that I had to make was here

# override: TARGET_ARCH - set the target architecture
# override:               defaults to -m64 for 64-bit machines
# override:                           -m32 for 32-bit machines
ifeq ($(arch), x86_64)
    TARGET_ARCH ?= -m64
    # on Linux, toolkit libraries are under /lib64 for 64-bit
    ifeq ($(platform), Linux)
        LIB_PATH_SFX = 64
    endif
else 
    ifeq ($(arch), aarch64)
                # Had to comment this out
        #TARGET_ARCH ?= -m64
            ifeq ($(platform), Linux)
                LIB_PATH_SFX = 64
            endif
    else
        # i386 or i686
        TARGET_ARCH ?= -m32
    endif
endif

This adjustment was made because the machine dependent option "-m64" does not exist for AArch64, hence appending the option on this line below causes compile errors

CXXFLAGS += $(TARGET_ARCH)

That being said, there is no machine dependent option being passed to the CXXFLAGS. I did however include the "-m64" option in the beginning of the nvcc-specific flags

CUFLAGS += -m64

Following those adjustments, the code compiles successfully after running

make

Note that I'm following the "default" options for GPUSPH (i.e. dam break problem with defaults) just to see if I can get the software to run. When I run the executable (./GPUSPH), here is the output I receive

 * No devices specified, falling back to default (dev 0)...
GPUSPH version v4.1+custom
Release version without fastmath for compute capability 6.2
Chrono : disabled
HDF5   : disabled
MPI    : disabled
Compiled for problem "DamBreak3D"
[Network] rank 0 (1/1), host 
 tot devs = 1 (1 * 1)
WARNING: setting number of layers for dynamic boundaries but not using DYN_BOUNDARY!
WARNING: number of layers for dynamic boundaries is low (3), suggested number is 4
Info stream: GPUSPH-24065
Initializing...
Water level not set, autocomputed: 0.4
Max fall height not set, autocomputed: 0.41
Max particle speed not set, autocomputed from max fall: 2.83623
setting dt = 0.00039 from CFL conditions (soundspeed: 0.00039, gravity: 0.0154445, viscosity: nan)
Using problem-set max neibs num 192 (safe computed value was 128)
Ferrari coefficient: 0.000000e+00 (default value, disabled)
Problem calling set grid params
Influence radius / neighbor search radius / expected cell side  : 0.052 / 0.052 / 0.052
 - World origin: 0 , 0 , 0
 - World size:   1.6 x 0.67 x 0.6
 - Cell size:    0.0533333 x 0.0558333 x 0.0545455
 - Grid size:    30 x 12 x 11 (3,960 cells)
 - Cell linearizazion: y,z,x
 - Dp:   0.02
 - R0:   0.02
Generating problem particles...
VTKWriter will write every 0.005 (simulated) seconds
HotStart checkpoints every 0.005 (simulated) seconds
    will keep the last 8 checkpoints
v4.1+custom
Allocating shared host buffers...
Numbodies : 1
Numforcesbodies : 1
numOpenBoundaries : 0
  allocated 1009.6 KiB on host for 13,601 particles (13,601 active)
Copying the particles to shared arrays...
---
Rigid body 1: 798 parts, mass nan, object mass 0
Open boundaries: 0
Fluid: 12800 parts, mass 0.008125
Boundary: 0 parts, mass 0
Testpoint: 3 parts
Tot: 13601 particles
---
RB First/Last Index:
    -12803  797
Preparing the problem...
Body: 0
     Cg grid pos: 17 6 5
     Cg pos: -0.00848052 -0.0279167 3.46945e-18
 - device at index 0 has 13,601 particles assigned and offset 0
Starting workers...
Thread 0x7faf074000 global device id: 0 (1)
thread 0x7faea9c1e0 device idx 0: CUDA device 0/1, PCI device 0000:00:00.0: NVIDIA Tegra X2
Device idx 0: free memory 648 MiB, total memory 7846 MiB
Estimated memory consumption: 508B/particle
number of forces rigid bodies particles = 798
Device idx 0 (CUDA: 0) allocated 0 B on host, 6.54 MiB on device
  assigned particles: 13,601; allocated: 13,601
GPUSPH: initialized
Performing first write...
Letting threads upload the subdomains...
Thread 0 uploading 13601 Position items (212.52 KiB) on device 0 from position 0
Thread 0 uploading 13601 Velocity items (212.52 KiB) on device 0 from position 0
Thread 0 uploading 13601 Info items (106.26 KiB) on device 0 from position 0
Thread 0 uploading 13601 Hash items (53.13 KiB) on device 0 from position 0
Entering the main simulation cycle
Simulation time t=0.000000e+00s, iteration=0, dt=3.900000e-04s, 13,601 parts (0, cum. 0 MIPPS), maxneibs 0
Device 0 thread 548391207392 iteration 0 last command: 7. Exception: src/cuda/forces.cu(516) : in unbind_textures() @ thread 0x548391207392 : cudaSafeCall() runtime API error 4 : unspecified launch failure
GPUSPH aborted by worker thread
Elapsed time of simulation cycle: 2.6s
Peak particle speed was ~0 m/s at 0 s -> can set maximum vel 0 for this problem
Simulation end, cleaning up...
src/GPUWorker.cc(1018) : in deallocateDeviceBuffers() @ thread 0x548391207392 : cudaSafeCall() runtime API error 4 : unspecified launch failure
Deallocating...

As a means to get a better clue as to what cause the failure, I run

cuda-memcheck ./GPUSPH

Note that running this doesn't require any special compile options. Here is the output beginning with "Entering the main simulation cycle" (i.e. the output above, matches that presented above) ,

:
:
:
Entering the main simulation cycle
Simulation time t=0.000000e+00s, iteration=0, dt=3.900000e-04s, 13,601 parts (0, cum. 0 MIPPS), maxneibs 0
========= Invalid __global__ read of size 16
=========     at 0x00000930 in /home/nvidia/GPUSPH/gpusph/src/cuda/forces_kernel.def:2359:void cuforces::forcesDevice<KernelType=3, SPHFormulation=1, BoundaryType=0, ViscosityType=1, unsigned long=261>(forces_params<KernelType=3, SPHFormulation=1, BoundaryType=0, ViscosityType=1, unsigned long=261>)
=========     by thread (71,0,0) in block (15,0,0)
=========     Address 0xfc0e152c0 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (cuLaunchKernel + 0x1e8) [0x1fe770]
=========     Host Frame:/usr/local/cuda-9.0/lib64/libcudart.so.9.0 [0xc984]
=========
:
:
:
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2e69e8]
=========     Host Frame:/usr/local/cuda-9.0/lib64/libcudart.so.9.0 (cudaDeviceSynchronize + 0x118) [0x2dec4]
=========
Device 0 thread 548319310304 iteration 0 last command: 7. Exception: src/cuda/forces.cu(516) : in unbind_textures() @ thread 0x548319310304 : cudaSafeCall() runtime API error 4 : unspecified launch failure
GPUSPH aborted by worker thread
Elapsed time of simulation cycle: 0.5s
Peak particle speed was ~0 m/s at 0 s -> can set maximum vel 0 for this problem
Simulation end, cleaning up...
========= Program hit cudaErrorInvalidResourceHandle (error 33) due to "invalid resource handle" on CUDA API call to cudaStreamDestroy. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2e69e8]
=========     Host Frame:/usr/local/cuda-9.0/lib64/libcudart.so.9.0 (cudaStreamDestroy + 0x134) [0x31aa4]
=========
:
:
:
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2e69e8]
=========     Host Frame:/usr/local/cuda-9.0/lib64/libcudart.so.9.0 (cudaFree + 0x12c) [0x34d10]
=========
:
:
:
src/GPUWorker.cc(1018) : in deallocateDeviceBuffers() @ thread 0x548319310304 : cudaSafeCall() runtime API error 4 : unspecified launch failure
Deallocating...
========= ERROR SUMMARY: 41 errors

Note that I've only shown the unique errors and removed those that repeat for the sake of presentation here. Obviously I started with the first error, located at "at 0x00000930 in /home/nvidia/GPUSPH/gpusph/src/cuda/forces_kernel.def:2359". Here is the code (in "forces_kernel.def") where the error is referring to

            #if PREFER_L1
                const float4 relPos = pos_corr - params.posArray[neib_index];  //   <-------- HERE
            #else
            const float4 relPos = pos_corr - tex1Dfetch(posTex, neib_index);
            #endif

Since the code was wrapped in an "if statement", I decided to try the alternative, which required that I change the definition in the "textures.cuh" source code to

#if defined(__COMPUTE__)
#if __COMPUTE__ >= 20 && __COMPUTE__/10 != 3
#define PREFER_L1 0
#else
#define PREFER_L1 0
#endif
#endif

In other words, I hard coded it such that "PREFER_L1" would always evaluate to false. I read the comments in the code about the L1 cache vs the shared memory, for which I also notice in the source code "cudautili.cu" there is a preference setting. I changed this as well to

        // Hard code this to use "shared" for 6.x compute capablity
        if (deviceProp.major == 3)
        {
            cacheConfig = cudaFuncCachePreferShared;
        }
        else if (deviceProp.major == 6)
        {
            cacheConfig = cudaFuncCachePreferShared;
        }

Therefore, I'm basically testing the code for the use of share vs L1 memory preference. I run a "make clean" then recompile the code via "make" and everything compiles as before. Running the code now (via ./GPUSPH) succeeds without the errors I was seeing before. Unfortunately, now the simulation blows up with the following output

 * No devices specified, falling back to default (dev 0)...
GPUSPH version v4.1+custom
Release version without fastmath for compute capability 6.2
Chrono : disabled
HDF5   : disabled
MPI    : disabled
Compiled for problem "DamBreak3D"
[Network] rank 0 (1/1), host 
 tot devs = 1 (1 * 1)
WARNING: setting number of layers for dynamic boundaries but not using DYN_BOUNDARY!
WARNING: number of layers for dynamic boundaries is low (3), suggested number is 4
Info stream: GPUSPH-25331
Initializing...
Water level not set, autocomputed: 0.4
Max fall height not set, autocomputed: 0.41
Max particle speed not set, autocomputed from max fall: 2.83623
setting dt = 0.00039 from CFL conditions (soundspeed: 0.00039, gravity: 0.0154445, viscosity: nan)
Using problem-set max neibs num 192 (safe computed value was 128)
Ferrari coefficient: 0.000000e+00 (default value, disabled)
Problem calling set grid params
Influence radius / neighbor search radius / expected cell side  : 0.052 / 0.052 / 0.052
 - World origin: 0 , 0 , 0
 - World size:   1.6 x 0.67 x 0.6
 - Cell size:    0.0533333 x 0.0558333 x 0.0545455
 - Grid size:    30 x 12 x 11 (3,960 cells)
 - Cell linearizazion: y,z,x
 - Dp:   0.02
 - R0:   0.02
Generating problem particles...
VTKWriter will write every 0.005 (simulated) seconds
HotStart checkpoints every 0.005 (simulated) seconds
    will keep the last 8 checkpoints
v4.1+custom
Allocating shared host buffers...
Numbodies : 1
Numforcesbodies : 1
numOpenBoundaries : 0
  allocated 1009.6 KiB on host for 13,601 particles (13,601 active)
Copying the particles to shared arrays...
---
Rigid body 1: 798 parts, mass nan, object mass 0
Open boundaries: 0
Fluid: 12800 parts, mass 0.008125
Boundary: 0 parts, mass 0
Testpoint: 3 parts
Tot: 13601 particles
---
RB First/Last Index:
    -12803  797
Preparing the problem...
Body: 0
     Cg grid pos: 17 6 5
     Cg pos: -0.00848052 -0.0279167 3.46945e-18
 - device at index 0 has 13,601 particles assigned and offset 0
Starting workers...
Thread 0x7faee0f000 global device id: 0 (1)
thread 0x7fae8371e0 device idx 0: CUDA device 0/1, PCI device 0000:00:00.0: NVIDIA Tegra X2
Device idx 0: free memory 614 MiB, total memory 7846 MiB
Estimated memory consumption: 508B/particle
number of forces rigid bodies particles = 798
Device idx 0 (CUDA: 0) allocated 0 B on host, 6.54 MiB on device
  assigned particles: 13,601; allocated: 13,601
GPUSPH: initialized
Performing first write...
Letting threads upload the subdomains...
Thread 0 uploading 13601 Position items (212.52 KiB) on device 0 from position 0
Thread 0 uploading 13601 Velocity items (212.52 KiB) on device 0 from position 0
Thread 0 uploading 13601 Info items (106.26 KiB) on device 0 from position 0
Thread 0 uploading 13601 Hash items (53.13 KiB) on device 0 from position 0
Entering the main simulation cycle
Simulation time t=0.000000e+00s, iteration=0, dt=3.900000e-04s, 13,601 parts (0, cum. 0 MIPPS), maxneibs 0
Simulation time t=5.154129e-03s, iteration=14, dt=3.231076e-04s, 13,601 parts (1.6, cum. 1.6 MIPPS), maxneibs 80
Simulation time t=7.129510e-03s, iteration=20, dt=3.713324e-04s, 13,601 parts (1.7, cum. 1.7 MIPPS), maxneibs 80
Simulation time t=1.010620e-02s, iteration=28, dt=3.651520e-04s, 13,601 parts (1.4, cum. 1.6 MIPPS), maxneibs 81
Simulation time t=1.083435e-02s, iteration=30, dt=3.626913e-04s, 13,601 parts (1.5, cum. 1.6 MIPPS), maxneibs 81
Simulation time t=1.514638e-02s, iteration=42, dt=3.499534e-04s, 13,601 parts (1.6, cum. 1.6 MIPPS), maxneibs 81
Simulation time t=1.800033e-02s, iteration=50, dt=3.665025e-04s, 13,601 parts (1.7, cum. 1.6 MIPPS), maxneibs 81
Simulation time t=2.019732e-02s, iteration=56, dt=3.769902e-04s, 13,601 parts (1.3, cum. 1.5 MIPPS), maxneibs 81
Simulation time t=2.165445e-02s, iteration=60, dt=3.777308e-04s, 13,601 parts (1.6, cum. 1.5 MIPPS), maxneibs 81
Simulation time t=2.533739e-02s, iteration=70, dt=3.600024e-04s, 13,601 parts (1.5, cum. 1.5 MIPPS), maxneibs 81
Simulation time t=3.015958e-02s, iteration=83, dt=3.818462e-04s, 13,601 parts (1.6, cum. 1.5 MIPPS), maxneibs 82
Simulation time t=3.277038e-02s, iteration=90, dt=3.818220e-04s, 13,601 parts (1.7, cum. 1.6 MIPPS), maxneibs 82
Simulation time t=3.500048e-02s, iteration=96, dt=3.805821e-04s, 13,601 parts (1.3, cum. 1.5 MIPPS), maxneibs 82
Simulation time t=3.647532e-02s, iteration=100, dt=3.681548e-04s, 13,601 parts (1.6, cum. 1.5 MIPPS), maxneibs 82
Simulation time t=4.020716e-02s, iteration=110, dt=3.576268e-04s, 13,601 parts (1.5, cum. 1.5 MIPPS), maxneibs 84
Simulation time t=4.534000e-02s, iteration=124, dt=3.738058e-04s, 13,601 parts (1.6, cum. 1.5 MIPPS), maxneibs 85
Simulation time t=4.755537e-02s, iteration=130, dt=3.684289e-04s, 13,601 parts (1.6, cum. 1.5 MIPPS), maxneibs 85
Simulation time t=5.013070e-02s, iteration=137, dt=3.830418e-04s, 13,601 parts (1.3, cum. 1.5 MIPPS), maxneibs 87
Simulation time t=5.127811e-02s, iteration=140, dt=3.555218e-04s, 13,601 parts (1.5, cum. 1.5 MIPPS), maxneibs 87
Simulation time t=5.528104e-02s, iteration=151, dt=3.819206e-04s, 13,601 parts (1.4, cum. 1.5 MIPPS), maxneibs 93
Simulation time t=5.850480e-02s, iteration=160, dt=3.546642e-04s, 13,601 parts (1.6, cum. 1.5 MIPPS), maxneibs 93
Simulation time t=6.032260e-02s, iteration=165, dt=3.295202e-04s, 13,601 parts (1.1, cum. 1.5 MIPPS), maxneibs 99
Simulation time t=6.215305e-02s, iteration=170, dt=3.509206e-04s, 13,601 parts (1.4, cum. 1.5 MIPPS), maxneibs 99
Simulation time t=6.500402e-02s, iteration=178, dt=3.668144e-04s, 13,601 parts (1.3, cum. 1.5 MIPPS), maxneibs 99
Simulation time t=6.574544e-02s, iteration=180, dt=3.755178e-04s, 13,601 parts (1.4, cum. 1.5 MIPPS), maxneibs 99
Simulation time t=7.015466e-02s, iteration=192, dt=3.795494e-04s, 13,601 parts (1.5, cum. 1.5 MIPPS), maxneibs 106
Simulation time t=7.312123e-02s, iteration=200, dt=3.776072e-04s, 13,601 parts (1.6, cum. 1.5 MIPPS), maxneibs 106
Simulation time t=7.534076e-02s, iteration=206, dt=3.711822e-04s, 13,601 parts (1.1, cum. 1.5 MIPPS), maxneibs 109
Simulation time t=7.681028e-02s, iteration=210, dt=3.629877e-04s, 13,601 parts (1.3, cum. 1.5 MIPPS), maxneibs 109
Simulation time t=8.007353e-02s, iteration=219, dt=3.739279e-04s, 13,601 parts (1.3, cum. 1.5 MIPPS), maxneibs 110
Simulation time t=8.044746e-02s, iteration=220, dt=3.575024e-04s, 13,601 parts (1.1, cum. 1.5 MIPPS), maxneibs 110
Simulation time t=8.528919e-02s, iteration=234, dt=3.611622e-04s, 13,601 parts (1.4, cum. 1.5 MIPPS), maxneibs 119
Simulation time t=8.731700e-02s, iteration=240, dt=3.615249e-04s, 13,601 parts (1.4, cum. 1.5 MIPPS), maxneibs 119
Simulation time t=9.020710e-02s, iteration=249, dt=3.563746e-04s, 13,601 parts (1.3, cum. 1.5 MIPPS), maxneibs 123
Simulation time t=9.056347e-02s, iteration=250, dt=3.531481e-04s, 13,601 parts (1.1, cum. 1.5 MIPPS), maxneibs 123
Simulation time t=9.516194e-02s, iteration=264, dt=2.900131e-04s, 13,601 parts (1.4, cum. 1.5 MIPPS), maxneibs 127
Simulation time t=9.707781e-02s, iteration=270, dt=3.319965e-04s, 13,601 parts (1.4, cum. 1.5 MIPPS), maxneibs 127
Simulation time t=1.000128e-01s, iteration=279, dt=3.541400e-04s, 13,601 parts (1.3, cum. 1.4 MIPPS), maxneibs 136
Simulation time t=1.003670e-01s, iteration=280, dt=3.588895e-04s, 13,601 parts (1.2, cum. 1.4 MIPPS), maxneibs 136
Simulation time t=1.051173e-01s, iteration=297, dt=2.594648e-04s, 13,601 parts (0.98, cum. 1.4 MIPPS), maxneibs 145
Simulation time t=1.058675e-01s, iteration=300, dt=2.323153e-04s, 13,601 parts (1.1, cum. 1.4 MIPPS), maxneibs 145
Simulation time t=1.100258e-01s, iteration=315, dt=2.829456e-04s, 13,601 parts (1.4, cum. 1.4 MIPPS), maxneibs 149
Simulation time t=1.115843e-01s, iteration=320, dt=3.023205e-04s, 13,601 parts (1.3, cum. 1.4 MIPPS), maxneibs 149
Simulation time t=1.150202e-01s, iteration=332, dt=2.643027e-04s, 13,601 parts (1.3, cum. 1.4 MIPPS), maxneibs 161
Simulation time t=1.170675e-01s, iteration=340, dt=3.592710e-04s, 13,601 parts (1.4, cum. 1.4 MIPPS), maxneibs 161
Simulation time t=1.200720e-01s, iteration=353, dt=2.100472e-04s, 13,601 parts (1.2, cum. 1.4 MIPPS), maxneibs 168
Simulation time t=1.219651e-01s, iteration=360, dt=2.795108e-04s, 13,601 parts (1.4, cum. 1.4 MIPPS), maxneibs 168
Simulation time t=1.252873e-01s, iteration=374, dt=1.990724e-04s, 13,601 parts (1.3, cum. 1.4 MIPPS), maxneibs 173
Simulation time t=1.265921e-01s, iteration=380, dt=2.693839e-04s, 13,601 parts (1.3, cum. 1.4 MIPPS), maxneibs 173
Simulation time t=1.301633e-01s, iteration=395, dt=2.499819e-04s, 13,601 parts (1.1, cum. 1.4 MIPPS), maxneibs 179
Simulation time t=1.313476e-01s, iteration=400, dt=2.050968e-04s, 13,601 parts (1.2, cum. 1.4 MIPPS), maxneibs 179
Simulation time t=1.350395e-01s, iteration=417, dt=1.727778e-04s, 13,601 parts (1.4, cum. 1.4 MIPPS), maxneibs 189
Simulation time t=1.356967e-01s, iteration=420, dt=2.657721e-04s, 13,601 parts (1.3, cum. 1.4 MIPPS), maxneibs 189
WARNING: current max. neighbors numbers 193 greather than MAXNEIBSNUM (192) at iteration 420
    possible culprit: -1 (neibs: 0)
WARNING: current max. neighbors numbers 198 greather than MAXNEIBSNUM (192) at iteration 430
    possible culprit: -1 (neibs: 0)
WARNING: current max. neighbors numbers 198 greather than MAXNEIBSNUM (192) at iteration 440
    possible culprit: -1 (neibs: 0)
Simulation time t=1.401277e-01s, iteration=441, dt=1.918957e-04s, 13,601 parts (1.3, cum. 1.4 MIPPS), maxneibs 198
Simulation time t=1.419053e-01s, iteration=450, dt=1.062711e-04s, 13,601 parts (1.2, cum. 1.4 MIPPS), maxneibs 198
WARNING: current max. neighbors numbers 203 greather than MAXNEIBSNUM (192) at iteration 450
    possible culprit: -1 (neibs: 0)
WARNING: current max. neighbors numbers 205 greather than MAXNEIBSNUM (192) at iteration 460
    possible culprit: -1 (neibs: 0)
Simulation time t=1.451039e-01s, iteration=467, dt=2.603823e-04s, 13,601 parts (1.3, cum. 1.4 MIPPS), maxneibs 205
Simulation time t=1.456839e-01s, iteration=470, dt=1.017485e-04s, 13,601 parts (1.2, cum. 1.4 MIPPS), maxneibs 205
WARNING: current max. neighbors numbers 209 greather than MAXNEIBSNUM (192) at iteration 470
    possible culprit: -1 (neibs: 0)
FATAL: timestep 2.09296e-08 under machine epsilon at iteration 476 - requesting quit...
WARNING: particle 1271 (id 645) has NAN position! (nan, nan, nan) @ (1, 0, 0) = (nan, nan, nan)
Simulation time t=1.463384e-01s, iteration=476, dt=2.092963e-08s, 13,601 parts (1, cum. 1.4 MIPPS), maxneibs 209
Elapsed time of simulation cycle: 4.8s
Peak particle speed was ~200.458 m/s at 0.146338 s -> can set maximum vel 2.2e+02 for this problem
Simulation end, cleaning up...
Deallocating...

This is as far as I got before decided to reach out for help. Based on the results from the test above, I believe it might have something to do with the memory. Here's some rudimentary comments concerning the memory on what I've found after some "google research"

GPU on Tegra TK1, TX1, and TX2 do not have their own memory. It is hard wired to the memory controller and shares system RAM. This also implies the GPU is not limited by PCIe bus speeds and PCIe management functions of a GPU don't apply.

I do not know what limits there might be on how much system RAM can be used by the GPU.

The Maxwell and Pascal architecture combined the texture and L1 cache into a single unified cache. All global, local, surface, and texture operations go through this cache.

here are some further links Jetson TX2 GPU memory L1 cache vs shared memory

Of course, I realize I'm not defining the source of the problem, but I've tried to provide as much info as I've gathered in my effort. I also realize that this is likely not the intended system for this application. I'm mainly interested in resolving this for the purpose of development (again, it's the only Nvidia GPU I have and it's cheap to buy for students ~$300). After testing and development, I would then later run the code on a more dedicated machine with better resources. So if there is anything that can be done to help me resolve these issues, I'd be grateful.

O and as far as host system, here are some of my specs

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/5/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.10' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-arm64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-arm64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-arm64 --with-arch-directory=aarch64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.10)

GPUSPH / gpusph

GPUSPH on Nvidia Jetson TX2 (Tegra X2) #32