ActivitySim / sandag-abm3-example

BSD 3-Clause "New" or "Revised" License
0 stars 2 forks source link

Multiprocess crashed on SFCTA machine with `Insufficient system resources` error #23

Open i-am-sijia opened 2 weeks ago

i-am-sijia commented 2 weeks ago

Running SANDAG on SFCTA's 1 TB RAM, 80 core, Intel Xeon 2.29 GHz server

Also reran with num_processors: 28, got the same error.

mp_households_21-activitysim.log activitysim.log

12/06/2024 15:23:34 - INFO - sharrow.shared_memory - read_shared_list:shared_memory_taz_None
12/06/2024 15:23:34 - INFO - sharrow.shared_memory - open_shared_memory_array:shared_memory_taz_None
12/06/2024 15:23:34 - NOTIFY - activitysim.core.workflow.runner -  time to execute run.av_ownership UNTIL ERROR : 1.062 seconds
12/06/2024 15:23:34 - WARNING - activitysim.core.mp_tasks - OSError exception running av_ownership model: [WinError 1450] Insufficient system resources exist to complete the requested service
12/06/2024 15:23:34 - ERROR - activitysim.core.mp_tasks - mp_tasks - mp_households_21 - OSError exception caught in mp_run_simulation: [WinError 1450] Insufficient system resources exist to complete the requested service
12/06/2024 15:23:34 - ERROR - activitysim.core.mp_tasks - 
---
Traceback (most recent call last):
  File "D:\activitysim\GitHub\activitysim\activitysim\core\mp_tasks.py", line 1097, in mp_run_simulation
    run_simulation(state, queue, step_info, resume_after, shared_data_buffer)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\mp_tasks.py", line 1047, in run_simulation
    raise e
  File "D:\activitysim\GitHub\activitysim\activitysim\core\mp_tasks.py", line 1042, in run_simulation
    state.run.by_name(model)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\workflow\runner.py", line 347, in by_name
    self._obj._context = run_named_step(
  File "D:\activitysim\GitHub\activitysim\activitysim\core\workflow\steps.py", line 83, in run_named_step
    step_func(context, **kwargs)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\workflow\steps.py", line 367, in run_step
    outcome = error_logging(wrapped_func)(state, *args, **kwargs)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\workflow\steps.py", line 46, in wrapper
    return func(*args, **kwargs)
  File "D:\activitysim\GitHub\sandag-abm3-example\.\extensions\av_ownership.py", line 85, in av_ownership
    expressions.assign_columns(
  File "D:\activitysim\GitHub\activitysim\activitysim\core\expressions.py", line 172, in assign_columns
    results = compute_columns(state, df, model_settings, locals_dict, trace_label)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\expressions.py", line 130, in compute_columns
    _locals_dict["skim_dict"] = state.get_injectable("skim_dataset_dict")
  File "D:\activitysim\GitHub\activitysim\activitysim\core\workflow\state.py", line 794, in get
    result = self._LOADABLE_OBJECTS[key](self._context)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\workflow\steps.py", line 300, in run_step
    arg_value = state._LOADABLE_OBJECTS[arg](context)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\workflow\steps.py", line 367, in run_step
    outcome = error_logging(wrapped_func)(state, *args, **kwargs)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\workflow\steps.py", line 46, in wrapper
    return func(*args, **kwargs)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\skim_dataset.py", line 895, in skim_dataset
    return load_skim_dataset_to_shared_memory(state)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\skim_dataset.py", line 750, in load_skim_dataset_to_shared_memory
    d = _use_existing_backing_if_valid(backing, omx_file_paths, skim_tag)
  File "D:\activitysim\GitHub\activitysim\activitysim\core\skim_dataset.py", line 465, in _use_existing_backing_if_valid
    out = sh.Dataset.shm.from_shared_memory(backing, mode="r")
  File "D:\activitysim\GitHub\sharrow\sharrow\shared_memory.py", line 507, in from_shared_memory
    mem = open_shared_memory_array(key, mode=mode)
  File "D:\activitysim\GitHub\sharrow\sharrow\shared_memory.py", line 143, in open_shared_memory_array
    result = SharedMemory(
  File "D:\activitysim\GitHub\asim_env\asim-consortium\lib\multiprocessing\shared_memory.py", line 180, in __init__
    self._mmap = mmap.mmap(-1, size, tagname=name)
OSError: [WinError 1450] Insufficient system resources exist to complete the requested service
---
jpn-- commented 2 weeks ago

Is it possible we are running out of disk space on this machine? I'm not totally sure if Windows is successfully identifying that it can hold all this data in RAM and maybe is spilling it to disk?

e.g. here is someone who thought they could hold everything in RAM, but couldn't... https://stackoverflow.com/questions/43573500/no-space-left-while-using-multiprocessing-array-in-shared-memory

I wonder if the version of Windows matters here... I will investigate some more.

joecastiglione commented 2 weeks ago

I don't think it's a disk space issue. The system drive is 800G SSD, with >600G free, the storage drive is 18T SSD RAID with 16.4T free.

OS is Windows Server 2022 Standard

jpn-- commented 2 weeks ago

I wouldn't rule out that it is a disk space thing, or a file I/O thing of some kind. Many of the stackoverflow issues that reference [WinError 1450] Insufficient system resources appear to be disk space or file handle problems.

The MP system used in ActivitySim is very disk-hungry, as copies of big tables get written out to disk for each MP process to use at the start of each multiprocess_steps group, then coalesced back into big tables again at the end of the MP group.

The error appears to be getting triggered when accessing the skims, but at the beginning of one of the MP groups. It's not clear to me why the error would trigger when opening the mmap for skims access, unless maybe the system resource we are exceeding is the number of open file handles, which I thought was a big number on modern windows but maybe not?

It is probably worth attempting to run multiprocess on the SFCTA machine with sharrow off, to see if that works or crashes in a similar way.

i-am-sijia commented 1 week ago

I ran a 10% household sample, Sharrow, multiprocessing run with 28 processors on Chavez. It crashed with the same WinError 1450 error as the 100% sample multiprocessing run.

I noticed there's a CHAMP CUBE window open with 94 Cube Cluster scripts waiting in the background. I am thinking these cluster scripts might be hogging the machine’s processors and therefore the ActivitySim multiprocessing run is failing.

I am going to test the following:

  1. Run multiprocessing with sharrow turned OFF to confirm whether multiprocessing crashes without sharrow.
  2. Close all the CHAMP window and scripts, run multiprocessing again. @joecastiglione can you please check and let me know if the CHAMP model can be safely closed?
i-am-sijia commented 1 week ago

Multiprocessing with sharrow turned OFF, 10% sample ran successfully. Closed all the CHAMP windows and scripts. Multiprocessing with sharrow turned ON, 10% sample still failed with the same error.