trivial_copy_n: an illegal memory access was encountered

marklit commented 8 years ago

Hi,

I'm loading in a 8.7 GB, 20 million line CSV file into Alenka. The import starts out well and a number of the .hash files growing to 50 MB+ but around 10 minutes into the load command I get an illegal memory access was encountered error message.

I've compiled the master branch of Alenka (commit 59022b5) on Ubuntu 16.04 64-bit with CUDA 8 and I'm running it with an Nvidia GTX 1080 and the 367.48 driver.

Here are the steps I took that led up to the issue:

$ cat load.sql

A  :=  LOAD 'trips_xaa.csv' USING (',') AS (
    trip_id{1}:int,
    vendor_id{2}:varchar(3),

    pickup_datetime{3}:varchar(19),

    dropoff_datetime{4}:varchar(19),
    store_and_fwd_flag{5}:varchar(1),
    rate_code_id{6}:int,
    pickup_longitude{7}:DECIMAL(14,2),
    pickup_latitude{8}:DECIMAL(14,2),
    dropoff_longitude{9}:DECIMAL(14,2),
    dropoff_latitude{10}:DECIMAL(14,2),
    passenger_count{11}:int,
    trip_distance{12}:DECIMAL(14,2),
    fare_amount{13}:DECIMAL(14,2),
    extra{14}:DECIMAL(14,2),
    mta_tax{15}:DECIMAL(14,2),
    tip_amount{16}:DECIMAL(14,2),
    tolls_amount{17}:DECIMAL(14,2),
    ehail_fee{18}:DECIMAL(14,2),
    improvement_surcharge{19}:DECIMAL(14,2),
    total_amount{20}:DECIMAL(14,2),
    payment_type{21}:varchar(3),
    trip_type{22}:int,
    pickup{23}:varchar(50),
    dropoff{24}:varchar(50),

    dummy1{25}:varchar(50),
    dummy2{26}:varchar(50),

    cab_type{27}:varchar(6),

    precipitation{28}:int,
    snow_depth{29}:int,
    snowfall{30}:int,
    max_temperature{31}:int,
    min_temperature{32}:int,
    average_wind_speed{33}:int,

    pickup_nyct2010_gid{34}:int,
    pickup_ctlabel{35}:varchar(10),
    pickup_borocode{36}:int,
    pickup_boroname{37}:varchar(13),
    pickup_ct2010{38}:varchar(6) ,
    pickup_boroct2010{39}:varchar(7) ,
    pickup_cdeligibil{40}:varchar(1) ,
    pickup_ntacode{41}:varchar(4) ,
    pickup_ntaname{42}:varchar(56),
    pickup_puma{43}:varchar(4) ,

    dropoff_nyct2010_gid{44}:int,
    dropoff_ctlabel{45}:varchar(10),
    dropoff_borocode{46}:int,
    dropoff_boroname{47}:varchar(13),
    dropoff_ct2010{48}:varchar(6) ,
    dropoff_boroct2010{49}:varchar(7) ,
    dropoff_cdeligibil{50}:varchar(1) ,
    dropoff_ntacode{51}:varchar(4) ,
    dropoff_ntaname{52}:varchar(56),
    dropoff_puma{53}:varchar(4) 
);
STORE A INTO 'trips' BINARY;

$ ~/Alenka_master/alenka load.sql

GeForce GTX 1080 : 1835.000 Mhz   (Ordinal 0)
20 SMs enabled. Compute Capability sm_61
FreeMem:   6941MB   TotalMem:   8110MB   64-bit pointers.
Mem Clock: 5005.000 Mhz x 256 bits   (320.3 GB/s)
ECC Disabled

Executing file:
Couldn't open data dictionary
LOAD: A trips_xaa.csv 53  , 
Append 0
STORE: A trips 
set a piece to 1000000000 6276186112
processed recs 6441843 2350317568
processed recs 6441843 2337603584
processed recs 6441843 2323513344
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  failed synchronize in thrust::system::cuda::detail::trivial_copy_n: an illegal memory access was encountered
Aborted (core dumped)

A few minutes before the exception nvidia-smi was showing the following:

Sun Oct 16 17:34:57 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0      On |                  N/A |
| 24%   50C    P2    45W / 200W |   6163MiB /  8110MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0       960    G   /usr/lib/xorg/Xorg                             705MiB |
|    0     15016    G   compiz                                         194MiB |
|    0     21266    G   ...isallowFetchForDocWrittenScriptsInMainFra   107MiB |
|    0     21714    C   /home/mark/Alenka_master/alenka               5147MiB |
+-----------------------------------------------------------------------------+

These were the last files to be modified before the exception:

$ ls -alht | head

total 18G
-rw-rw-r--  1 mark mark  40K okt   16 18:22 trips.dropoff_puma
-rw-rw-r--  1 mark mark  13M okt   16 18:22 trips.dropoff_puma.2.idx
-rw-rw-r--  1 mark mark   20 okt   16 18:22 trips.dropoff_puma.header
drwxrwxr-x  2 mark mark  20K okt   16 18:22 .
-rw-rw-r--  1 mark mark  50M okt   16 18:22 trips.dropoff_puma.2.hash
-rw-rw-r--  1 mark mark 6,2M okt   16 18:22 trips.dropoff_ntaname.2.idx
-rw-rw-r--  1 mark mark   20 okt   16 18:22 trips.dropoff_ntaname.header
-rw-rw-r--  1 mark mark  50M okt   16 18:22 trips.dropoff_ntaname.2.hash
-rw-rw-r--  1 mark mark 6,2M okt   16 18:22 trips.dropoff_ntacode.2.idx

Here is the last few lines of strace:

...
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174334632}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174345876}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174357176}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174368495}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174379727}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {30119, 174394497}) = 0
ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7fffb66b4840) = 0
ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7fffb66b4810) = 0
write(16, "\253", 1)                    = 1
futex(0x7fbaa6088680, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(2, "terminate called after throwing "..., 48terminate called after throwing an instance of ') = 48
write(2, "thrust::system::system_error", 28thrust::system::system_error) = 28
write(2, "'\n", 2'
)                      = 2
write(2, "  what():  ", 11  what():  )             = 11
write(2, "failed synchronize in thrust::sy"..., 108failed synchronize in thrust::system::cuda::detail::trivial_copy_n: an illegal memory access was encountered) = 108
write(2, "\n", 1
)                       = 1
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
tgkill(22953, 22953, SIGABRT)           = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=22953, si_uid=1000} ---
+++ killed by SIGABRT (core dumped) +++
Aborted (core dumped)

Any idea what might have caused this issue or what I can do to work around it? I'm happy to provide more telemetry if needed.

Cheers, Mark

antonmks commented 8 years ago

I never seen this error message before. Can I get that csv file somewhere to test the load on my machine ? Also, do you really need to load all the fields ? It seems that you need just a few for the queries. You can try loading just those and see if you still get the error.

marklit commented 8 years ago

I'll email you a link to the file.

I'll play around with loading in a reduced set of data in the mean time.

antonmks commented 8 years ago

I loaded the data successfully. The only issue I had was 'out of memory' error, so I had to reduce the segment size to 500 MB : ./alenka -l 500 load_trips.sql Can you try it ?

marklit commented 8 years ago

Certainly, I'll try that and report back.

marklit commented 8 years ago

With 500 as a parameter I got that exception again but with 200 it loaded just fine.

~/Alenka_master/alenka -l 500 load.sql

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  failed synchronize in thrust::system::cuda::detail::trivial_copy_n: an illegal memory access was encountered

~/Alenka_master/alenka -l 200 load.sql
~/Alenka_master/alenka query.sql

...
mRecCount=1 mcount = 1 term 1 limit=0 print_all=1
|20000046 |

Thanks for your help on that one.

antonmks commented 8 years ago

While running a query on your data I found and fixed a bug in Alenka. I updated the master branch, so please update if you have any issues.

marklit commented 8 years ago

Good stuff, I'll re-compile Alenka before I start the 1.1B record import. I've earmarked Saturday to get started on this.

antonmks commented 8 years ago

Don't forget to use APPEND when loading consecutive files !

On Wed, Oct 19, 2016 at 10:43 AM, Mark Litwintschik < notifications@github.com> wrote:

Good stuff, I'll re-compile Alenka before I start the 1.1B record import. I've earmarked Saturday to get started on this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/antonmks/Alenka/issues/103#issuecomment-254737134, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhkFC_YbTkmPt86Yn-_feyoUdr2uxwvks5q1coKgaJpZM4KX_Xx .

antonmks commented 8 years ago

I made a few changes to alenka including addition of CAST operator necessary for your queries. Also, please notice that in a load script the types should be specified in lower caps, like "decimals", not "DECIMALS", otherwise it is not going to work, alenka is case sensitive. I tested your queries and a new load script, if you need them I attached them all to this message.

Best regards,

Anton

On Wed, Oct 19, 2016 at 11:42 AM, mks antonmks@gmail.com wrote:

Don't forget to use APPEND when loading consecutive files !

On Wed, Oct 19, 2016 at 10:43 AM, Mark Litwintschik < notifications@github.com> wrote:

Good stuff, I'll re-compile Alenka before I start the 1.1B record import. I've earmarked Saturday to get started on this.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/antonmks/Alenka/issues/103#issuecomment-254737134, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhkFC_YbTkmPt86Yn-_feyoUdr2uxwvks5q1coKgaJpZM4KX_Xx .

marklit commented 8 years ago

Thanks Anton.

I'm not seeing the attachments here, could you send them over again please?

antonmks commented 8 years ago

That was an old message, I remember that after that I sent all the queries as text to you in an email. It might take for me a few days to add what I need to add to make queries 3 and 4 run, I'll try to it this weekend.

Anton

On Wed, Oct 26, 2016 at 9:11 AM, Mark Litwintschik <notifications@github.com

wrote:

Thanks Anton.

I'm not seeing the attachments here, could you send them over again please?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/antonmks/Alenka/issues/103#issuecomment-256258285, or mute the thread https://github.com/notifications/unsubscribe-auth/ABhkFCppfjsHbRZvFde_MpYHhZqW6pfaks5q3u77gaJpZM4KX_Xx .

marklit commented 8 years ago

Cool. I'll earmark Sunday evening again to have another go with all this.

antonmks commented 8 years ago

I fixed an issue with APPEND and groupby operators, so Q1 should work. Unfortunately you have to reload the data. I'll start working on the rest of the queries.

marklit commented 8 years ago

Great, I'll recompile and import the data again on Sunday and report back.

antonmks / Alenka

trivial_copy_n: an illegal memory access was encountered #103