apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.02k stars 3.42k forks source link

[Python] Coredump when joining big large_strings #33151

Open asfimport opened 1 year ago

asfimport commented 1 year ago

joining large strings in pyarrow results in this error:


terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted (core dumped) 

example code: note that this needs quite some ram (run on 128GB)


import pyarrow as pa    
     
ids = [x for x in range(2**24)]    
text = ['a'*2**10]*2**24    
schema = pa.schema([    
    ('Id', pa.int32()),    
    ('Text', pa.large_string()),    
    ])    
     
tab1 = pa.Table.from_arrays([ids, text],schema=schema)    
tab2 = pa.Table.from_arrays([ids, text],schema=schema)    
     
joined = tab1.join(tab2, keys='Id', right_keys='Id', left_suffix='tab1')  

the same results in a segfault, if i use this schema


schema = pa.schema([
    ('Id', pa.int32()),
    ('Text', pa.string()),
    ])

 

 

 

 

Environment: run inside a fedora container: registry.fedoraproject.org/fedora-toolbox:36

host information: uname -a:

Linux ws1 5.18.16-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 3 15:44:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

/etc/os-release:

NAME="Fedora Linux" VERSION="36 (Container Image)" ID=fedora VERSION_ID=36 VERSION_CODENAME="" PLATFORM_ID="platform:f36" PRETTY_NAME="Fedora Linux 36 (Container Image)" ANSI_COLOR="0;38;2;60;110;180" LOGO=fedora-logo-icon CPE_NAME="cpe:/o:fedoraproject:fedora:36" HOME_URL="https://fedoraproject.org/" DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f36/system-administrators-guide/" SUPPORT_URL="https://ask.fedoraproject.org/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Fedora" REDHAT_BUGZILLA_PRODUCT_VERSION=36 REDHAT_SUPPORT_PRODUCT="Fedora" REDHAT_SUPPORT_PRODUCT_VERSION=36 PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy" VARIANT="Container Image" VARIANT_ID=container Reporter: flowpoint

Note: This issue was originally created as ARROW-17943. Please see the migration documentation for further details.

asfimport commented 1 year ago

Yibo Cai / @cyb70289: The code below triggers same error log. Try it online: https://onlinegdb.com/UpqUsk4Zv Looks this might be caused by integer overflow which leads to a huge buffer size greater than std::vector::max_size().


#include <vector>

int main() {
    std::vector<int> v;
    v.resize(-1ULL);
    return 0;
}
asfimport commented 1 year ago

Yibo Cai / @cyb70289: Error comes from below line, total_length = -2128609280 https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/row_encoder.cc#L344

asfimport commented 1 year ago

Yibo Cai / @cyb70289: @michalursa, do you have comments? change offsets_ from int32 to int64?

asfimport commented 1 year ago

flowpoint: cool, with this tip i think i got it to work for now for my usecase this is ofc. not a true solution https://gist.github.com/flowpoint/08e76e9a90544009b298e5bea9219236

 

asfimport commented 1 year ago

Yibo Cai / @cyb70289: Glad to hear that [~flowpoint]. Since you already have a patch (thought not a true solution), I think it might be worth to submit a PR for review.

idailylife commented 7 months ago

Any update on this issue? The patch above doesn't seem to be merged:(