apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.34k stars 3.48k forks source link

[Python] Consider splitting _lib module into several parts #40166

Open pitrou opened 7 months ago

pitrou commented 7 months ago

Describe the enhancement requested

When reading the logs of a wheel build on Windows I noticed these lines:

  lib.cpp
C:\Python312\Lib\site-packages\numpy\_core\include\numpy\ndarraytypes.h(1250,38): warning C4200: nonstandard extension used: zero-sized array in struct/union [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\Python312\Lib\site-packages\numpy\_core\include\numpy\ndarraytypes.h(1250,38): message : This member will be ignored by a defaulted constructor or copy/move assignment operator [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\arrow\python\build\temp.win-amd64-cpython-312\lib.cpp(335145,17): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\arrow\python\build\temp.win-amd64-cpython-312\lib.cpp(335645,17): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]
C:\arrow\python\build\temp.win-amd64-cpython-312\lib.cpp(335855,17): warning C4244: '=': conversion from 'Py_ssize_t' to 'long', possible loss of data [C:\arrow\python\build\temp.win-amd64-cpython-312\lib.vcxproj]

Ignoring what the warnings say, what stands out is that the lib.cpp generated by Cython has at least 335000 lines. This is huge and can obviously lead to enormous compile times, especially if the RAM is not large enough for the C++ compiler to hold the entire intermediate representation(s) in memory.

We should definitely try to split the _lib into smaller parts, in order to alleviate this problem.

(it is also a Cython problem that so much C++ code is generated, but I'm not sure we can fix that).

Component(s)

Python

pitrou commented 7 months ago

@AlenkaF @jorisvandenbossche @danepitkin I think this would be worth looking into, in the "quality of life" department.

pitrou commented 7 months ago

(I also posted on the Cython users ML: https://groups.google.com/g/cython-users/c/hr3cFevY46k)

pitrou commented 7 months ago

Note that our Windows wheel builds routinely take 3 hours, and it may very well be because of this: https://github.com/ursacomputing/crossbow/actions/runs/7970517386/job/21758260987

pitrou commented 7 months ago

This might also be related to the AppVeyor timeouts.

pitrou commented 7 months ago

Related: https://github.com/apache/arrow/pull/40225

jorisvandenbossche commented 7 months ago

When it comes to "quality of life" while developing pyarrow locally, I would personally prioritize improving our build system to have proper rebuilds (https://github.com/apache/arrow/issues/36411#issuecomment-1753704373), but of course I am also biased because not using Windows and not seeing this issue locally. And improving build times for Ci is definitely important as well.

The previous time we worked on splitting pyarrow.lib I brought up the back-compat issue for people (c)importing from there, see https://github.com/apache/arrow/pull/10162#issuecomment-831829432 and the comments below. Of course we can decide to break that once in a release, but I would still prefer we have a clearer story about how we recommend to use pyarrow in those cases.

There might also be some smaller things we could already split off that are less controversial / less publicly used (for example benchmark.pxi, although this is only a tiny one-function file and won't help much. A bigger one might be tensor.pxi)

jorisvandenbossche commented 7 months ago

We should maybe also experiment with ways to do this in a less breaking way. For example, can we still include things in lib.pxd so cimport keeps working, while moving actual implementations out of pyarrow.lib. In pure Python something like that certainly works, but I don't know by heart how cython would deal with that.

pitrou commented 7 months ago

When it comes to "quality of life" while developing pyarrow locally, I would personally prioritize improving our build system to have proper rebuilds (#36411 (comment)),

I think we should do both :-)

pitrou commented 6 months ago

Related to quality of life on incremental builds: https://github.com/cython/cython/issues/6070