Program ends in error using input_file

Issue Summary On the current master branch, running ./dedisperse-gpu ../input_files/BenMeerKAT.txt (from within the build/ directory) starts the program which then ends in

De-dispersing...
CUDA error at host_main_function.cu:234 code=13(cudaErrorInvalidSymbol) "cudaGetLastError()"

Full console output here:

./dedisperse-gpu ../input_files/BenMeerKAT.txt 

 Using standard GPU code
range:      5
debug:      1
multi_file: 1
analysis:   1
output_dmt: 0
sigma_cutoff:   6.000000
power:      2.000000
User requested DM search range:
0.000000    370.000000  0.307000    1
370.000000  740.000000  0.652000    2
740.000000  1480.000000 1.266000    4
1480.000000 2950.000000 2.512000    8
2950.000000 5000.000000 4.000000    16
Got user input:     8.000000000000001e-05(s)

12  HEADER_START
11  source_name
37  P: 3000.000000000000 ms, DM: 1500.000
10  machine_id
12  telescope_id
9   data_type
4   fch1
4   foff
6   nchans
5   nbits
6   tstart
5   tsamp
4   nifs
10  HEADER_END
 Using standard GPU code
tsamp:          0.000064
tstart:         50000.000000
fch1:           1564.000000
foff:           -0.208984
nchans:         2048
nifs:           1
nbits:          8
nsamples:       0
nsamp:          937984
Got file header info:   0.248815(s)

 Using standard GPU code
Maxshift efficiency:        100.00%
Host Input size:        3664 MB
Host Output size:       0 MB
Device Input size:      0 MB
Device Output size:     0 MB
Allocated memory:   0.248893(s)

 Using standard GPU code
Got input filterbank data:  1.301598(s)
Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN X"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 12207 MBytes (12799770624 bytes)
  GPU Clock rate:                                1076 MHz (1.08 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0

 Using standard GPU code
Initialised GPU:        1.423872(s)

Maximum number of dm trials in any of the range steps:  1240
Range:  4, MAXSHIFT:    118496, Scrunch value:  16
Maximum dispersive delay:   7.58 (s)
Diagonal DM:    90.793449
In 4

Maxshift memory needed: 462 MB
Output memory needed:   925 MB
 Using standard GPU code

maximum DM:     5119.000000
maxshift:       118496
max_ndms:       1240
Actual DM range that will be searched:
0.000000    380.680023  0.307000    1240
380.680023  771.880005  0.652000    600
771.880005  1531.479980 1.266000    600
1531.479980 3038.680176 2.512000    600
3038.680176 5118.680176 4.000000    520
Calculated strategy:    1.423929(s)

 Using standard GPU code
Maxshift efficiency:        87.37%
Host Input size:        3664 MB
Host Output size:       5603 MB
Device Input size:      0 MB
Device Output size:     0 MB
Allocated memory:   1.428153(s)

774368

 Using standard GPU code
Maxshift efficiency:        87.37%
Host Input size:        3664 MB
Host Output size:       5603 MB
Device Input size:      3024 MB
Device Output size:     6049 MB
Allocated memory:   1.433887(s)

----------------------- MSD info ---------------------------
  Memory required by boxcar filters:17063.320 MB
  Memory available:2765.812 MB 
  Max samples: :105967552

  DMs_per_cycle: 160
  Size MSD: 1024    Size workarea: 781, int: 32
------------------------------------------------------------

De-dispersing...
CUDA error at host_main_function.cu:234 code=13(cudaErrorInvalidSymbol) "cudaGetLastError()"

Steps to Reproduce Clone the astro-accelerate repository, compile as usual.

Run dedisperse-gpu with one of the input files.

Expected Outcome Expect a graceful exit with output or an explanation why the program cannot continue.

Actual Outcome Ends in an error as described above.

Configuration No changes to the default configuration, running on the astraios machine with CUDA 8.0

Notes N/A.

AstroAccelerateOrg / astro-accelerate

Program ends in error using input_file #86