apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.41k stars 3.5k forks source link

[Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset #26000

Closed asfimport closed 3 years ago

asfimport commented 4 years ago

https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe

I have a dataframe split and stored in more than 5000 files. I use ParquetDataset(fnames).read() to load all files. I updated the pyarrow to latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of memory: malloc of size 131072 failed". The same code on the same machine still works with older version. My machine has 256Gb memory way more than enough to load the data which requires < 10Gb. You can use below code to generate the issue on your side.


import pandas as pd
import numpy as np
import pyarrow.parquet as pq

def generate():
    # create a big dataframe

    df = pd.DataFrame({'A': np.arange(50000000)})
    df['F1'] = np.random.randn(50000000) * 100
    df['F2'] = np.random.randn(50000000) * 100
    df['F3'] = np.random.randn(50000000) * 100
    df['F4'] = np.random.randn(50000000) * 100
    df['F5'] = np.random.randn(50000000) * 100
    df['F6'] = np.random.randn(50000000) * 100
    df['F7'] = np.random.randn(50000000) * 100
    df['F8'] = np.random.randn(50000000) * 100
    df['F9'] = 'ABCDEFGH'
    df['F10'] = 'ABCDEFGH'
    df['F11'] = 'ABCDEFGH'
    df['F12'] = 'ABCDEFGH01234'
    df['F13'] = 'ABCDEFGH01234'
    df['F14'] = 'ABCDEFGH01234'
    df['F15'] = 'ABCDEFGH01234567'
    df['F16'] = 'ABCDEFGH01234567'
    df['F17'] = 'ABCDEFGH01234567'

    # split and save data to 5000 files
    for i in range(5000):
        df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)

def read_works():
    # below code works to read
    df = []
    for i in range(5000):
        df.append(pd.read_parquet(f'{i}.parquet'))

    df = pd.concat(df)

def read_errors():
    # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
    # tried use_legacy_dataset=False, same issue

    fnames = []
    for i in range(5000):
        fnames.append(f'{i}.parquet')

    len(fnames)

    df = pq.ParquetDataset(fnames).read(use_threads=False)
 

 

Reporter: Ashish Gupta

Related issues:

Note: This issue was originally created as ARROW-9974. Please see the migration documentation for further details.

asfimport commented 4 years ago

Ashish Gupta: Apologies sample code hasn't formatted nicely, please refer to the stackoverflow.

asfimport commented 4 years ago

Wes McKinney / @wesm: cc @bkietz @jorisvandenbossche

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: [~kgashish] thanks for opening the issue here!

As mentioned on StackOverflow, I can't reproduce the issue locally (I tried with 10x smaller data because I don't have enough RAM otherwise, but then don't get the error), so some follow up questions: 1) Can you show the full traceback for both the case with use_legacy_dataset set to True or False? 2) Do you still get the error if you eg comment out adding columns F15 to F17 ?

asfimport commented 4 years ago

Ashish Gupta: 1) Please find attached full traceback of both cases legacy_false.txt

 

2) I started removing columns one by one and below is the smallest dataframe where adding the last column F11 causes both scenarios use_legacy_dataset set to True and False to throw error. Interestingly, use_legacy_dataset=True starts crashing after first 5 columns. If you have more than 8Gb ram you should be able to test it.


    # create a big dataframe
    import pandas as pd
    import numpy as np    df = pd.DataFrame({'A': np.arange(50000000)})
    df['F1'] = np.random.randn(50000000)
    df['F2'] = np.random.randn(50000000)
    df['F3'] = np.random.randn(50000000)
    df['F4'] = 'ABCDEFGH'
    df['F5'] = 'ABCDEFGH'
    df['F6'] = 'ABCDEFGH'
    df['F7'] = 'ABCDEFGH'
    df['F8'] = 'ABCDEFGH'
    df['F9'] = 'ABCDEFGH'
    df['F10'] = 'ABCDEFGH'
    # df['F11'] = 'ABCDEFGH'

 

 

 

asfimport commented 4 years ago

Krisztian Szucs / @kszucs: I was also unable to reproduce the error, tried with both of your examples on master and 1.0.1. Could you show the backtrace from the coredump?

asfimport commented 4 years ago

Ashish Gupta: It seems it has something to do with the operating system. The code is crashing on a machine running linux centos 8. I tried the same code with the same versions of conda/pandas/pyarrow on a Windows pc and it worked. Would it be possible for you to test it on centos-8

cat /etc/os-release NAME="CentOS Linux" VERSION="8 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="8" PLATFORM_ID="platform:el8" PRETTY_NAME="CentOS Linux 8 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:8" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"

 

I am not sure how to get backtrace from the coredump, if you can advise I will try to provide it to you.

 

asfimport commented 4 years ago

Krisztian Szucs / @kszucs: Sure!

First you need to enable coredumps on the system, usually by running ""ulimit -c unlimited". Then trigger the segfault again which should generate a file named "core" usually in the same directory. Once the core file is available you can examine it with "gdb path/to/executable path/to/core", in this case "gdb python ./core" if you are located where the core file is.

Inside gdb running "bt" command should show you the backtrace of the segfault.

asfimport commented 4 years ago

Ashish Gupta: I have test.py as below


 

 


import pyarrow.parquet as pq
fnames = []
for i in range(5000):
 fnames.append(f'{i}.parquet')
len(fnames)
df = pq.ParquetDataset(fnames, use_legacy_dataset=True).read(use_threads=False)

 

with use_legacy_dataset=True, there is no core dump, just the below error

 


Traceback (most recent call last):
 File "test.py", line 9, in <module>
 df = pq.ParquetDataset(fnames, use_legacy_dataset=True).read(use_threads=False)
 File "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py", line 1271, in read
 table = piece.read(columns=columns, use_threads=use_threads,
 File "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py", line 718, in read
 table = reader.read(**options)
 File "/data/install/anaconda3/lib/python3.8/site-packages/pyarrow/parquet.py", line 326, in read
 return self.reader.read_all(column_indices=column_indices,
 File "pyarrow/_parquet.pyx", line 1125, in pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Out of memory: realloc of size 65600 failed
cannot allocate memory for thread-local data: ABORT

 

 

with use_legacy_dataset=False, there is a core dump but I think there is some issue reading the coredump file.

 


Reading symbols from /data/install/anaconda3/bin/python...done.
BFD: warning: /data/dump/temp/core.python.862529738.cd627b19559c40969f42d3bb01c5e03d.739784.1601400696000000 is truncated: expected core file size >= 4752101376, found: 2147483648
warning: core file may not match specified executable file.
[New LWP 739784]
[New LWP 739791]
[New LWP 739795]
[New LWP 739788]
[New LWP 739785]
[New LWP 739792]
[New LWP 739790]
[New LWP 739794]
[New LWP 739793]
[New LWP 739789]
Cannot access memory at address 0x7f86e6283128
Cannot access memory at address 0x7f86e6283120
Failed to read a valid object file image from memory.
Core was generated by `/data/install/anaconda3/bin/python test.py'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f86e5aae70f in ?? ()
[Current thread is 1 (LWP 739784)]
(gdb) bt
#0 0x00007f86e5aae70f in ?? ()
Backtrace stopped: Cannot access memory at address 0x7ffcf1fb8c80

 

 

Would you be able to run the example on a machine with linux centos 8?

 

 

asfimport commented 4 years ago

Krisztian Szucs / @kszucs: I have not had the time to test it on centos 8 yet, but will try to reproduce it during the week.

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: The error message ("OSError: Out of memory: malloc of size 131072 failed") tells us that the failure is returned by the glibc memory allocator, not by the jemalloc allocator which is used by Arrow for array data. Also, the failed allocation is tiny (128 kB). This hints at a possible heap fragmentation problem.

I'll recommend you try playing with the glibc malloc tunables, especially the MALLOC_MMAP_THRESHOLD_ environment variable (note trailing underscore). For example MALLOC_MMAP_THRESHOLD_=65536. See https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html for reference.

asfimport commented 4 years ago

Ben Kietzman / @bkietz: I'm moving this out of the 2.0 release since we don't have a clean reproducer and it's evidently centos8-only. If we find a consistent reproducer then we can pull it back into 2.0

asfimport commented 4 years ago

Ashish Gupta: Anyone tried to reproduce on centos-8?

asfimport commented 4 years ago

Antoine Pitrou / @pitrou: [~kgashish] Can you try what I suggested above?

asfimport commented 4 years ago

Ashish Gupta: Tried...

export MALLOC_MMAPTHRESHOLD=65536

same error "OSError: Out of memory: malloc of size 131072 failed"

 

asfimport commented 3 years ago

Weston Pace / @westonpace: I attempted to reproduce this on centos-8 and was not successful.  I used an Amazon EC2 M4.large instance with the following AMI image (https://aws.amazon.com/marketplace/pp/B08KYLN2CG)

Since the image only contains 8GB of RAM I used the smaller dataset example you posted.

Ironically the read_works method failed at pd.concat due to an out of RAM error (a legitimate one as pandas needed an additional 2.1 GB which was not available).

The read_errors method succeeded with use_legacy_dataset set to true or false.

It appears the core file you generated ran into some kind of 2GB limit.  Since you have 256GB on the machine your core file could be quite large.  Try following the advice here (https://stackoverflow.com/questions/43341954/is-2g-the-limit-size-of-coredump-file-on-linux) to see if you are able to make any further progress.

 

==Details of the test machine==

 

[centos@ip-172-30-0-34 ~]$ cat /etc/redhat-release CentOS Linux release 8.2.2004 (Core) [centos@ip-172-30-0-34 ~]$ uname -a Linux ip-172-30-0-34.ec2.internal 4.18.0-193.19.1.el8_2.x86_64 #1 SMP Mon Sep 14 14:37:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux [centos@ip-172-30-0-34 ~]$ cat /etc/os-release NAME="CentOS Linux" VERSION="8 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="8" PLATFORM_ID="platform:el8" PRETTY_NAME="CentOS Linux 8 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:8" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-8" CENTOS_MANTISBT_PROJECT_VERSION="8" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="8"

[centos@ip-172-30-0-34 ~]$ python3 -mpip freeze asn1crypto==0.24.0 Babel==2.5.1 cffi==1.11.5 chardet==3.0.4 cloud-init==19.4 configobj==5.0.6 cryptography==2.3 dbus-python==1.2.4 decorator==4.2.1 gpg==1.10.0 idna==2.5 Jinja2==2.10.1 jsonpatch==1.21 jsonpointer==1.10 jsonschema==2.6.0 MarkupSafe==0.23 netifaces==0.10.6 numpy==1.19.4 oauthlib==2.1.0 pandas==1.1.5 pciutils==2.3.6 perf==0.1 ply==3.9 prettytable==0.7.2 pyarrow==1.0.1 pycairo==1.16.3 pycparser==2.14 pygobject==3.28.3 PyJWT==1.6.1 pyOpenSSL==18.0.0 pyserial==3.1.1 PySocks==1.6.8 python-dateutil==2.8.1 python-dmidecode==3.12.2 python-linux-procfs==0.6 pytz==2017.2 pyudev==0.21.0 PyYAML==3.12 requests==2.20.0 rhnlib==2.8.6 rpm==4.14.2 schedutils==0.6 selinux==2.9 sepolicy==1.1 setools==4.2.2 setroubleshoot==1.1 six==1.11.0 slip==0.6.4 slip.dbus==0.6.4 syspurpose==1.26.20 systemd-python==234 urllib3==1.24.2

asfimport commented 3 years ago

Ashish Gupta: Thanks for looking into this. I suspect the dataframe size is too small to show the error. Did you try gradually increasing the size by say 5% and try read_error? 

asfimport commented 3 years ago

Weston Pace / @westonpace: I was able to reproduce memory issues at 65000000 rows.  However, this is always about the point where I would expect to see issues (since I was nearing the 8GB limit).  I did not see the errors you were seeing because the system was overcomitting and triggering the OOM killer instead of hard stopping the allocation.  Once I changed the overcommit mode to 2 (no overcommit) then I started getting malloc/realloc errors.  I'm wondering if this may be related to the issues you are seeing.  Can you share with me some more information?

Is this running on a VM or a dedicated server?

What is the output of sysctl vm.overcommit_memory?

What is the output of cat /proc/meminfo?

What is the output of free?

asfimport commented 3 years ago

Ashish Gupta: This is a dedicated physical server.

sysctl vm.overcommit_memory vm.overcommit_memory = 2

 

cat /proc/meminfo MemTotal: 263518320 kB MemFree: 34640640 kB MemAvailable: 247394700 kB Buffers: 52 kB Cached: 217424924 kB SwapCached: 5308 kB Active: 175441652 kB Inactive: 46026880 kB Active(anon): 3637200 kB Inactive(anon): 540420 kB Active(file): 171804452 kB Inactive(file): 45486460 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 4194300 kB SwapFree: 3900668 kB Dirty: 8 kB Writeback: 0 kB AnonPages: 4025572 kB Mapped: 350944 kB Shmem: 185972 kB KReclaimable: 3498500 kB Slab: 5042144 kB SReclaimable: 3498500 kB SUnreclaim: 1543644 kB KernelStack: 19744 kB PageTables: 27404 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 135953460 kB Committed_AS: 6058728 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB Percpu: 82240 kB HardwareCorrupted: 0 kB AnonHugePages: 548864 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 8881248 kB DirectMap2M: 137568256 kB DirectMap1G: 121634816 kB

 

free total used free shared buff/cache available Mem: 263518320 10488492 31968144 185980 221061684 244854292 Swap: 4194300 293632 3900668

 

 

asfimport commented 3 years ago

Weston Pace / @westonpace: I believe what is happening is that the ParquetDataset approach is using more memory.  That is because the pyarrow.Table in-memory representation is larger than the pandas dataframe in-memory representation (in this case).  Specifically for strings.  Arrow is going to store each string as an array of bytes + 4 bytes so each instance of 'ABCDEFGH' is going to occupy 12 bytes of RAM.  On the other hand, pandas is going to store an 8 byte pointer to an 8 byte string.  If all the strings were different this would be mean 16 bytes per string but since they are all the same the string instance is shared so it is more or less 8 bytes.

So for the smaller table, arrow is using ~5.8GB while pandas is using ~4.4GB.  This explains why it fails in 1.0.1 but does not explain...

"256Gb memory way more than enough to load the data which requires < 10Gb"

For this I think the problem is simply that your system is not allowing each process to use 256GB.  With "vm.overcommit_memory = 2" the OS is going to avoid overcomitting entirely.  In addition, a large portion of RAM (~120 GB) is reserved for the kernel (this is tunable and you might want to consider tuning it since this is a rather large amount to reserve for the kernel).  The remaining 135953460KB (seen as CommitLimit in the meminfo) is shared across all processes.  Since overcomitting is disabled this is tracking the reserved (not used) RAM from all processes.

To confirm all of this I suggest two tests.

1) Confirm how much RAM is actually in use by python / pyarrow

Change the last lines of your script to...

 


try:
    read_errors()
except:
    max_bytes = pa.default_memory_pool().max_memory()
    input("Press Enter to continue...")
    print(f'Arrow bytes in use: {max_bytes}')
    raise

This will pause the program at the crash and allow you to inspect the memory.  You can do this by looking up the process...

 

 


(base) [centos@ip-172-30-0-34 ~]$ ps -eaf | grep -i python
root         880       1  0 Dec18 ?        00:00:41 /usr/libexec/platform-python -Es /usr/sbin/tuned -l -P
centos    228668  225417 90 17:45 pts/0    00:00:16 python experiment.py
centos    228680  228186  0 17:45 pts/1    00:00:00 grep --color=auto -i python

...and then lookup the RAM usage of the process.

 

 


(base) [centos@ip-172-30-0-34 ~]$ cat /proc/228668/status | grep -i vmdata
VmData:  3445600 kB

In addition, the experiment will print how many bytes arrow was using (to help distinguish from RAM used by the python heap and RAM reserved but not in use)...

 

 


Press Enter to continue...^[[A
Arrow bytes in use: 2601828992
Traceback (most recent call last):

So, even though my system has 8GB of RAM because 4GB is reserved for the kernel, ~0.5 GB is in use by other processes, ~1GB is in use by python, only ~2.6GB remain for arrow.

 

2) Confirm how much RAM your system is currently allowing to be allocated.

Compile and run the following simple program...


#include <stdio.h>   
#include <stdlib.h>int MBYTE = 1024*1024;int main(void) {  int mbytes_allocated = 0;
  int ** pointers = malloc(MBYTE);
  while(mbytes_allocated < MBYTE) {
    int * pointer = malloc(MBYTE);
    if (pointer == 0) {
      for (int i = 0; i < mbytes_allocated; i++) {
    free(pointers[i]);
      }
      break;
    }
    mbytes_allocated++;
  }
  printf("Allocated %d megabytes before failing\n", mbytes_allocated);
}

 (base) [centos@ip-172-30-0-34 ~]$ gcc -o allocator allocator.c
 (base) [centos@ip-172-30-0-34 ~]$ ./allocator
 Allocated 3395 megabytes before failing

This matches pretty closely with what we were seeing in python.

As for a fix, in the new datasets API (available starting in 1.1 but more fully in 2.0) you can use scan which will allow you to convert to pandas incrementally and should have similar RAM usage to the first approach you had.  You may also want to tune your OS as you are reserving quite a bit for the kernel and that might be too much.  vm.overcommit_ratio defaults to 0.5 and that is often too aggressive for systems with large amounts of RAM.

 

asfimport commented 3 years ago

Ashish Gupta: If the system memory limit is the issue, would it have worked on the same machine with the older version of pyarrow? My code was working perfectly fine with 0.13.0.  Regarding tests you asked me...

1) Confirm how much RAM is actually in use by python / pyarrow

  read_error crashes with a core dump, so I am not able to use try/except block.


python test.py
terminate called after throwing an instance of 'std::bad_alloc'
 what(): std::bad_alloc
Aborted (core dumped)

 

However I was observed in top command that memory usage was around 5Gb when it crashed.

 

2) ./allocator Allocated 110641 megabytes before failing

 

With read_works I checked the max memory required for my example is 15Gb. So given that 110 Gb is available and there is nothing else running it doesn't make sense. 

 

 

asfimport commented 3 years ago

Weston Pace / @westonpace:

If the system memory limit is the issue, would it have worked on the same machine with the older version of pyarrow? My code was working perfectly fine with 0.13.0.

I understand, my theory was that this new version was using more RAM which was causing the issue.  Right now, I would like to narrow down the problem between something on the system limiting your allocation and some bug in pyarrow causing a large spike in allocation and pushing it over the limit.

So I think it is important to know exactly how much RAM the process was using when it failed (for example, if it is exactly or very close to 4GB then that gives us a potential limit to look for.  If there is some loop getting stuck and allocating memory really quickly then we'd see 110GB and it might not show in top because it happens so quick).

It sounds like your process crashes in a couple of different ways.  If you get an OSError then you should be able to catch it with the python code I shared.  If you are now consistently getting std::bad_alloc then you can still catch it using gdb.  Unfortunately, gdb won't catch the OSError so it might be a bit of trial and error.  It also sounds like I am not quite reproducing the same behavior you are seeing.

I will continue to look into possibilities after the holiday.  In the meantime, if you are able to figure out exactly how much RAM the process is using when it crashes it could be helpful.

asfimport commented 3 years ago

Weston Pace / @westonpace: Ok.  I think I've really tracked it down now.  It appears the root cause is a combination of jemalloc, aggressive muzzy decay, and disabling overcommit.  Jemalloc is tracking the issue [[https://github.com/jemalloc/jemalloc/issues/1328]|https://github.com/jemalloc/jemalloc/issues/1328].]

Jemmalloc ends up creating many many small mmaps and eventually the vm.max_map_count is reached.  You can see the value of this limit here.

 


[centos@ip-172-30-0-42 ~]$ sysctl vm.max_map_count
vm.max_map_count = 65530

To confirm this is indeed your issue you will need to pause it at crash time (either using gdb or python's except/input as discussed above) and count the number of maps...

 


[centos@ip-172-30-0-42 ~]$ cat /proc/1829/maps | wc -l
65532

(note, this approach will give an approximation of the # of maps and not the exact count, but it shouldn't be anywhere close to the limit under normal operation).

 

 

 

 

The preferred workaround per the jemalloc issue is to enable overcommit.  You can configure the system to prioritize killing the process using arrow if the oom killer is too unpredictable.

 

If overcommit must be disabled, for whatever reason, then you could always compile arrow without jemalloc.

 

Finally, in the issue listed above there is some configuration that is suggested, basically reducing the rate at which jemalloc returns pages to the OS.  This jemalloc configuration would not be universally applicable however so it doesn't make sense for Arrow to change these defaults.  These settings are also not configurable at the moment so this option isn't really possible given the current code.

asfimport commented 3 years ago

Weston Pace / @westonpace: Also, this behavior was introduced between 0.13.0 and 1.0.1 because arrow changed how it configures jemalloc so that pages are returned to the OS more aggressively.  For more details see ARROW-6994 (which was fixed in Arrow 0.16.0).

asfimport commented 3 years ago

Ashish Gupta: Looks like that's the issue. Thanks!


cat /proc/377933/maps | wc -l
65532

I am able to pause read_errors with use_legacy_dataset=True (throws an OSError), while use_legacy_dataset=False crashes.

So if overcommit is enabled vm.max_map_count will not increase that quickly?

Why this is not an issue on windows 10, read somewhere windows 10 doesn't overcommit memory?

 

asfimport commented 3 years ago

Antoine Pitrou / @pitrou:

Why this is not an issue on windows 10, read somewhere windows 10 doesn't overcommit memory?

IIRC, jemalloc isn't enabled on Windows, so we either use mimalloc or the system allocator there.

So if overcommit is enabled vm.max_map_count will not increase that quickly?

If I'm reading correctly, yes.

Note that Arrow can use other memory allocators, in two different ways:

1) you can disable jemalloc at compile-time by passing -DARROW_JEMALLOC=OFF to CMake

2) you can pass custom memory pools at runtime: see https://arrow.apache.org/docs/cpp/api/memory.html?highlight=mimalloc#memory-pools

Related: ARROW-11009.

asfimport commented 3 years ago

Ashish Gupta: I will wait for ARROW-11009 to be resolved so I can switch memory allocators from an environment variable.

My knowledge of cpp is very limited, anything I can change in pyarrow installation or run time call to use another memory allocators?

This is great analysis, a lot of learning. Thanks.

asfimport commented 3 years ago

Weston Pace / @westonpace: You could try adding...

pa.jemalloc_set_decay_ms(10000)

to the top of your script.  I don't have time to test it today and I don't know that this would be a complete fix or if it would just postpone the issue but it should move behavior back to closer to what it was in 0.13.0.

asfimport commented 3 years ago

Ashish Gupta: Tried, didn't work.

asfimport commented 3 years ago

Weston Pace / @westonpace: Now that ARROW-11049  is finished I tried this out with the latest from master.  I found that both the system memory allocator and the jemalloc memory allocator (the default) encountered this problem with the mmap limit.

However, the mimalloc allocator does not encounter this issue.  This means that you will need to install a version of pyarrow that has mimalloc enabled and you will need to add this to the top of your program, preferably before you do anything with pyarrow.


pa.set_memory_pool(pa.mimalloc_memory_pool())
asfimport commented 3 years ago

Antoine Pitrou / @pitrou: Ok, there's probably no need to keep this issue open. In ARROW-11228 we'll add more tuning possibilities for the cases where our default jemalloc configuration isn't adequate.