Open i-am-sijia opened 2 weeks ago
Is it possible we are running out of disk space on this machine? I'm not totally sure if Windows is successfully identifying that it can hold all this data in RAM and maybe is spilling it to disk?
e.g. here is someone who thought they could hold everything in RAM, but couldn't... https://stackoverflow.com/questions/43573500/no-space-left-while-using-multiprocessing-array-in-shared-memory
I wonder if the version of Windows matters here... I will investigate some more.
I don't think it's a disk space issue. The system drive is 800G SSD, with >600G free, the storage drive is 18T SSD RAID with 16.4T free.
OS is Windows Server 2022 Standard
I wouldn't rule out that it is a disk space thing, or a file I/O thing of some kind. Many of the stackoverflow issues that reference [WinError 1450] Insufficient system resources
appear to be disk space or file handle problems.
The MP system used in ActivitySim is very disk-hungry, as copies of big tables get written out to disk for each MP process to use at the start of each multiprocess_steps
group, then coalesced back into big tables again at the end of the MP group.
The error appears to be getting triggered when accessing the skims, but at the beginning of one of the MP groups. It's not clear to me why the error would trigger when opening the mmap for skims access, unless maybe the system resource we are exceeding is the number of open file handles, which I thought was a big number on modern windows but maybe not?
It is probably worth attempting to run multiprocess on the SFCTA machine with sharrow off, to see if that works or crashes in a similar way.
I ran a 10% household sample, Sharrow, multiprocessing run with 28 processors on Chavez. It crashed with the same WinError 1450 error as the 100% sample multiprocessing run.
I noticed there's a CHAMP CUBE window open with 94 Cube Cluster scripts waiting in the background. I am thinking these cluster scripts might be hogging the machine’s processors and therefore the ActivitySim multiprocessing run is failing.
I am going to test the following:
Multiprocessing with sharrow turned OFF, 10% sample ran successfully. Closed all the CHAMP windows and scripts. Multiprocessing with sharrow turned ON, 10% sample still failed with the same error.
Running SANDAG on SFCTA's 1 TB RAM, 80 core, Intel Xeon 2.29 GHz server
multiprocess: True
num_processors: 40
sharrow: require
Also reran with
num_processors: 28
, got the same error.mp_households_21-activitysim.log activitysim.log