ActivitySim / activitysim

An Open Platform for Activity-Based Travel Modeling
https://activitysim.github.io
BSD 3-Clause "New" or "Revised" License
193 stars 99 forks source link

Minimum computer specifications to conduct a chunk-size training run #543

Open ray-ngo opened 2 years ago

ray-ngo commented 2 years ago

MWCOG staff encountered an insufficient resources error when running a chunk-size training run for our Gen3, Phase I, Model on a server with 128 GB of RAM and 12 physical cores. Here are our questions:

  1. How do we know if the chuck size training run failed due to RAM versus due to lack of enough processors? The log file does not help answer this question.
  2. What are the minimum requirements, such as for RAM and processors, for a computer to be able to run a chunk-size training?
jpn-- commented 2 years ago

Your chunk training attempt failed due to a lack of RAM. There is no minimum number of processors required for ActivitySim (if you are patient). But you do need enough RAM to load the skims plus at least a little extra to work in. How much that is depends on the number of zones, the number of modeled time periods, and the number of different tables represented. Thus, the minimum RAM needed is more a function of your model implementation and not something that is generalizable to ActivitySim at large.

There is an experimental memory-mapped skims interface that might allow for a reduction in the RAM required, and once the sharrow interface is operable more generally (several months from now) the RAM required will be reduced.

JilanChen commented 2 years ago

SEMCOG ABM run into the same issue in one of our machines (128 GB RAM and 24 cores). Currently, RSG/SEMCOG are still investigating into the cause of the issue. SEMCOG ABM Phase I (one zone system) used to be able to run with this machine but this OSError started to pop up a few months ago. When we monitored the computer performance, the highest memory use is at about 90 GB so not sure if it's RAM issue or something else.

stefancoe commented 2 years ago

What model is causing the crash? What are your chunk settings? Have you been able to run the model to completion with a smaller household sample size? I would try reducing the chunk_size and see if that helps. Also, the number of processors should be a few below the total number available.

JilanChen commented 2 years ago

Currently, we are testing in the training mode with the Chunk_size: 100_000_000_000 and num_processes: 20. The crash for SEMCOG's ABM mostly happened during the time of running "workplace_location".

ray-ngo commented 2 years ago

It is good to know that there is no requirement for the number of processors. We can upgrade RAMs on our older servers (which have less cores and little RAM) to run our ActivitySim model. Thanks @jpn-- !

RSG told us that our model needs around 110 GB of RAM to run, while we set the chunk_size variable at 102 GB (80% of the available RAM) in our testing run. I guess we have to set the chunk_size value no less than 110 GB. Please correct me if I am wrong.

@stefancoe: What is the role of the processor variable in the chunk_size training? Feel free to point me to the documentation discussing this. We set the variable at 80% of the total available in our run.

AndrewTheTM commented 2 years ago

One thing that would be nice out of ActivitySim is better RAM logging and reporting. Using a completed run of @ray-ngo 's model above, the logged RAM maximum values are:

rss: 76,714,344,448 (77 GB) full_rss: 1,247,299,629,056 (1.2 TB) uss: 76,975,316,992 (77 GB)

So the model won't run on a 128 GB server set with a chunk_size of 102 GB, and the server I used has 244 GB. The skims need 76.6 GB (according to the log file). The answer of 'how much RAM does this model need' seems to be greater than 102 GB (more than rss and uss) and less than 244 GB (way less than max full_rss). Ray's model failed in shared_data_buffers, and on the completed run, the full_rss use was 152 GB (roughly 2* skim_buffer) at that point (per the mem.csv log), but that same field hit 1.2 TB on a machine that doesn't have that much RAM, so... ?

stefancoe commented 2 years ago

@ray-ngo Here is the documentation for multiprocessing: https://activitysim.github.io/activitysim/core.html#multiprocessing

When using mp without chunking, ActivitySim breaks the 'problem' (where the problem is often a very large table of choosers and alternatives) into parts equal to the num_processors argument. So 10 processors = 10 tables. Next, the program works on each table in parallel, one per processor. This greatly decreases ActivitySim's runtime, but does not do anything to manage the amount of RAM that is used.

If the tables and skims and everything else require more RAM than is available, the program will crash. Chunking is used to fit the 'problem' into the available RAM by further reducing the size of each table, which are run sequentially for each process. So, using the example of 10 processors, if a sub-model/step requires two chunks given available RAM, then each of the 10 tables are broken into 2 for each processor to work on sequentially. The training step is used to determine how many chunks are needed for each step given available RAM and number of processors used.

The need to use less processors than are available is to leave some compute power for the OS to manage all this and, I am guessing, to handle anything that might be multithreaded. Some python libraries can take advantage of multithreading, which may not be ideal when we are already using multiprocessing. There is a setting, 'MKL_NUM_THREADS: 1', which should limit libraries like numpy to use only one thread.

I would try reducing both chunk size and num_processors a little bit more and see if that helps. Have you run the model with a smaller number of households?

**Edit Just read @AndrewTheTM comments. Sounds like 128 gb is just not enough. Decreases runtime not increases!

jfdman commented 2 years ago

Has anyone tried the system registry changes suggested here: https://stackoverflow.com/questions/53752487/oserror-winerror-1450-insufficient-system-resources-exist-to-complete-the-req ?

aletzdy commented 2 years ago

Has anyone tried the system registry changes suggested here: https://stackoverflow.com/questions/53752487/oserror-winerror-1450-insufficient-system-resources-exist-to-complete-the-req ?

I tried the registry change option in addition to the other suggested fixes here on SEMCOG's 128GB server, and it did not help.