Since the best way to install cython-blis is to compile it from source to take advantage of the machine architecture. In the case of our HPC cluster, I end up re-installing cython-blis on each node executor at the start of each job to make sure I'm using optimized code, but this takes a bit of time.
Given that BLIS has a lot of source files, the build process can be parallelized easily. I just changed the logic of the ExtensionBuilder.compile_objects code to actually invoke the compiler to build objects in parallel with a ThreadPool, based on the parallel flag of the command line (which is a default build_ext option), or using the MAX_JOBS environment variable (similar to what torch and flash-attn are doing).
By default, I left the job count to 1, so that parallel compilation happens only if enabled. Using 4 threads, the compilation is about twice faster:
Hi again!
Since the best way to install
cython-blis
is to compile it from source to take advantage of the machine architecture. In the case of our HPC cluster, I end up re-installingcython-blis
on each node executor at the start of each job to make sure I'm using optimized code, but this takes a bit of time.Given that BLIS has a lot of source files, the build process can be parallelized easily. I just changed the logic of the
ExtensionBuilder.compile_objects
code to actually invoke the compiler to build objects in parallel with aThreadPool
, based on theparallel
flag of the command line (which is a defaultbuild_ext
option), or using theMAX_JOBS
environment variable (similar to whattorch
andflash-attn
are doing).By default, I left the job count to
1
, so that parallel compilation happens only if enabled. Using 4 threads, the compilation is about twice faster: