Multiprocessing slows defect input writing on M1 Mac

utf commented 8 months ago

The default multiprocessing options actually slow down input set writing on my M1 Max Mac.

Setup

from pymatgen.core import Structure
from doped.generation import DefectsGenerator
from doped.vasp import DefectsSet

prim = Structure.from_file("prim.vasp")
defect_gen = DefectsGenerator(prim)
defect_set = DefectsSet(defect_gen, soc=False)

Writing with multiprocessing

defect_set.write_files("defects", unperturbed_poscar=True)

Time taken: 5 minutes 1 second

Writing without multiprocessing

# same as before...
defect_set.write_files("defects", unperturbed_poscar=True, processes=1)

Time taken: 14 seconds

Fix

Using multiple threads instead of multiple processes seems to fix the issue.

from multiprocessing.pool import ThreadPool as Pool

Time taken: 13 seconds

However, the timing doesn't substantially improve on no multiprocessing.

kavanase commented 8 months ago

Thanks for flagging this @utf (and producing a solution 🙌 )! I'll run some checks to see how dependent the (fixed) multiprocessing speedup is on the size of DefectsSet, and update accordingly (e.g. for defects generation, testing showed a tradeoff depending on the number of inequivalent sites etc).

kavanase commented 7 months ago

Hi @utf! I've had time to properly look at this now. Tbh, I'm not totally sure what's going on or how it was that slow on your setup. In the test cases I'd been looking at before, multiprocessing did give a decent speed up in certain cases (e.g. from like 40/50 seconds to 10/20 seconds for CdTe and LiMn1.5Ni0.5O4). I was running on a M1 Pro 2021 Macbook, 32 Gb RAM. – Possibly a memory issue that I'm not suffering from, but I would've guessed your laptop is surely more powerful than mine

I've now updated the POTCAR generation (5a07c80) to use caching which greatly speeds up the file generation/writing steps after doing some profiling. With the reduced computational load from this, I find multiprocessing gives less of a speedup than before, but still faster than without multiprocessing especially as the number of folders to write grows (at least once it's above ~30 defect folders). As well, on my system, it's still a good bit faster with Pool rather than ThreadPool (and ThreadPool is actually a bit slower than with no multiprocessing).

E.g. with CdTe (all auto-generated intrinsic defects): It's around 7/8 seconds with the auto multiprocessing setup, and around 10 seconds with ThreadPool or with no multiprocessing:

and with LiMn1.5Ni0.5O4: around 50/60 seconds with ThreadPool / no multiprocessing, ~24 seconds with default Pool multiprocessing:

If you had time at some point, would you mind trying this with the current develop branch of doped (just running this code, with and without processes=1 to compare the timings) to see if whatever the issue was is fixed, or if it's a system-dependent thing that needs to be accounted for? Thanks! 🙌

from pymatgen.core import Structure
from doped.generation import DefectsGenerator
from doped.vasp import DefectsSet

cdte = Structure.from_file("CdTe_POSCAR")
defect_gen = DefectsGenerator(cdte)
ds = DefectsSet(defect_gen)
ds.write_files("test_pop")

POSCARs.zip

kavanase commented 7 months ago

Makes sense that ThreadPool would be quicker if file I/O (rather than CPU) was the limiting factor, so not sure if that gives any hints about the differences in behaviour here

utf commented 7 months ago

Hi @kavanase, I just tested the code in the develop branch and it brought the time down to 8 seconds. Not sure what the issue was before but it seems to be fixed (could have also been some weird issue on my end).

kavanase commented 7 months ago

Ok sick, thanks for checking Alex!

utf commented 7 months ago

Thanks so much for digging into this.

kavanase commented 7 months ago

No problemo, thanks for raising!

SMTG-Bham / doped