Open pitrou opened 7 months ago
@AlenkaF @jorisvandenbossche @danepitkin I think this would be worth looking into, in the "quality of life" department.
(I also posted on the Cython users ML: https://groups.google.com/g/cython-users/c/hr3cFevY46k)
Note that our Windows wheel builds routinely take 3 hours, and it may very well be because of this: https://github.com/ursacomputing/crossbow/actions/runs/7970517386/job/21758260987
This might also be related to the AppVeyor timeouts.
When it comes to "quality of life" while developing pyarrow locally, I would personally prioritize improving our build system to have proper rebuilds (https://github.com/apache/arrow/issues/36411#issuecomment-1753704373), but of course I am also biased because not using Windows and not seeing this issue locally. And improving build times for Ci is definitely important as well.
The previous time we worked on splitting pyarrow.lib
I brought up the back-compat issue for people (c)importing from there, see https://github.com/apache/arrow/pull/10162#issuecomment-831829432 and the comments below.
Of course we can decide to break that once in a release, but I would still prefer we have a clearer story about how we recommend to use pyarrow in those cases.
There might also be some smaller things we could already split off that are less controversial / less publicly used (for example benchmark.pxi
, although this is only a tiny one-function file and won't help much. A bigger one might be tensor.pxi
)
We should maybe also experiment with ways to do this in a less breaking way. For example, can we still include things in lib.pxd
so cimport
keeps working, while moving actual implementations out of pyarrow.lib. In pure Python something like that certainly works, but I don't know by heart how cython would deal with that.
When it comes to "quality of life" while developing pyarrow locally, I would personally prioritize improving our build system to have proper rebuilds (#36411 (comment)),
I think we should do both :-)
Related to quality of life on incremental builds: https://github.com/cython/cython/issues/6070
Describe the enhancement requested
When reading the logs of a wheel build on Windows I noticed these lines:
Ignoring what the warnings say, what stands out is that the
lib.cpp
generated by Cython has at least 335000 lines. This is huge and can obviously lead to enormous compile times, especially if the RAM is not large enough for the C++ compiler to hold the entire intermediate representation(s) in memory.We should definitely try to split the
_lib
into smaller parts, in order to alleviate this problem.(it is also a Cython problem that so much C++ code is generated, but I'm not sure we can fix that).
Component(s)
Python