ActivitySim / activitysim

An Open Platform for Activity-Based Travel Modeling
https://activitysim.github.io
BSD 3-Clause "New" or "Revised" License
189 stars 96 forks source link

Very Slow Chunking #861

Open dhensle opened 2 months ago

dhensle commented 2 months ago

Describe the bug Chunk training takes a VERY long time.

Performed on a SANDAG server with 1 TB of RAM (chunk_size was set to 450GB), ran with only 64k households (~5%) and 5 cores. Run time was 66.85 hours, or 2.78 days!

To Reproduce Run the SANDAG ABM3 model in chunk training mode. This was performed with the BayDAG_estimation branch which is based off ActivitySim version 1.2.

Expected behavior Chunk training shouldn't take all that much longer than actually running the model. We have not seen this long of chunk training behavior before. Is there something about the SANDAG model that takes a long time? (e.g. two-zones?) Is the problem a dependency was updated that really hit the performance?

Additional context Log files can be seen here: training_log.zip

Running in production mode also took an extremely long time (again > 2.5 days!). Part of the problem may be that the num_processors setting was set to 40, but the machine only had 32, but this shouldn't make that big of a deal.

Looking at the production logs shows that about 700 minutes(!) of run time was in the parking location choice model. This looks to be due to ActivitySim creating a chunk for every single chooser in that model (hence the statements like Running chunk 10450 of 10456 with 1 of 10456 choosers in the log.) The chunk_cache.csv (found in the training_log above) certainly shows that more than one row should be allowed per chunk when the chunk_size is set to 450GB. production_log_subset.zip

Is this behavior related to #860?

(Currently working on reproducing with the main branch, but run is not yet complete. I will update once complete...)

dhensle commented 2 months ago

As mentioned above, I tested with the current main branch of the code and the sandag-abm3-example. The results were very similar.

I ran with 100k households in chunk_training mode without sharrow and with 10 cores. The chunk training run took about 24 hours!

Log files are attached: log_abm3_chunk_train_100k.zip