Very long time to write skims to shared data buffer when using sharrow

dhensle commented 2 months ago

Describe the bug It took about 7.5 hours to load the skims into the shared data buffer. This has happened on multiple machines now.

The attached timing and activitysim log show the very long runtimes.

timing_log.csv activitysim.log

The skim load times for non-sharrow runs are the roughly 25 mins as expected for this model. Additionally, running the MTC example model in sharrow compile mode does not show this same behavior. A log file for that run shows the "activitysim.core.skim_dataset - writing skims to shared memory" step takes just a couple of seconds.

To Reproduce Run in sharrow compile mode on a windows server.

Expected behavior Skim load times should be roughly comparable to non-sharrow mode.

cc: @jpn--, @i-am-sijia

jpn-- commented 2 months ago

Whoa that is crazy.

To confirm: This is using the skim OMX files encoded with blosc2:zstd, as unpacked from the v0.2.0 release assets? And, the 25 minutes load time is from non-sharrow, but using those same OMX files?

I am curious if the load times are better or worse or same if you use the "original" OMX-standard zlib compressed skims

dhensle commented 2 months ago

These results are from the "original" OMX-standard skims. And yes, the same OMX files were used in the non-sharrow run.

Perhaps related, it was taking a ton of time to unpack the blosc2:zstd skims on the one machine I was trying it on, so I just used the existing OMX-standard skims I already had. Haven't gotten back to try to reproduce that long unpack time yet...

Wondering if @i-am-sijia has had similar issues which might point it to a windows vs mac/linux thing?

i-am-sijia commented 2 months ago

Perhaps related, it was taking a ton of time to unpack the blosc2:zstd skims on the one machine I was trying it on, so I just used the existing OMX-standard skims I already had. Haven't gotten back to try to reproduce that long unpack time yet...

Wondering if @i-am-sijia has had similar issues which might point it to a windows vs mac/linux thing?

Yes, I'm seeing the same slow unpacking. The exercise.py downloaded all 16 v0.2.0 release assets successfully. It's now unpacking them into .\sandag-abm3-example\data_full. It took 1 hour to just go from “reading part000” to “reading part001”, and another hour to "reading part002".

I suppose it is going to take >10 hours to finish unpacking.

i-am-sijia commented 2 months ago

The long time to unpack the data files persists on windows.
I then got the unpacked data files directly from Jeff. I ran sharrow with the blosc2:zstd encoded skims, the "activitysim.core.skim_dataset - writing skims to shared memory" step took only 5 mins. activitysim.log

jpn-- commented 2 months ago

CS has a similar result, loading skims into shared memory took about 12 minutes (using blosc2:zstd) on a Windows server. I believe performance on this step is very hardware-specific, and hinges a lot on the bandwidth between the "disk" (typically SSD) and the CPU/RAM. We're loading 140GB of data from disk into memory, if the read speed of the disk is limited then this step is going to take a bit.

27/04/2024 17:06:07 - INFO - activitysim.core.skim_dataset - writing skims to shared memory
27/04/2024 17:18:26 - INFO - activitysim.abm.models.initialize - initialize_landuse.annotate_tables - annotating land_use SPEC annotate_landuse

On my MacBook Pro the load time is only 48 seconds, but I've got a very fast machine and the SSD read speed is upwards of 5GB/sec.

Let's not worry about "unpack" times for now, that's not a real ActivitySim function just a bit of help code for downloading data for testing. We can just stick with sharepoint for large file service in the now, and solve that problem later.

dhensle commented 2 months ago

Running with the blosc2 skims on the machine will "write skims to shared memory" in about 7.5 minutes.

This seems to support our theory that there is an inefficiency causing very slow read times with the original zlib omx files that occurs when running with sharrow on.

jpn-- commented 2 months ago

We changed the code to load data for skims in series instead of in parallel, and ran a test loading the skims on a Windows server with 1TB of RAM. Here is a figure showing the time to load each skim page (all are the same size in memory). They start off about 2-3 seconds, then jump up to irregular timings 10-20 seconds per page but with some spikes. The jump happens in the middle of the KNROUT skims, once about 60GB has been allocated in RAM, and does not appear to be related to which pages are being loaded, but just how many.

i-am-sijia commented 2 months ago

Ran with the original skims on my machine, "writing skims to shared memory" took about 16 hours, even longer than the 7.5 hours reported by David.

activitysim.log timing_log.csv

jpn-- commented 2 months ago

@dhensle, you were going to perhaps study testing skim load times with cropped skims. The tool to convert OMX files from zlib to blosc2:zstd is wring, install with pip install wring[omx].

jpn-- commented 2 months ago

To attempt to replicate the memory profiling result I show above on a different machine, you'll need to update sharrow to version 2.8.3, which adds the ability to pick the dask scheduler to use for skim loading, and then use code from https://github.com/camsys/activitysim/tree/skim-read which makes use of that power to be able to select the synchronous scheduler.

i-am-sijia commented 2 months ago

sharrow (v2.8.3): 29aaf6cf561e027587b3b2eb2bb152e0134db8b0 activitysim (camsys/skim-read): 874647c872c65deb20a4b6e655c0b7d6e02a6ba3 skim data: zlib compressed

I ran the single-threaded skim (with dask_scheduler = synchronous) loading on a 512 GB, 2.44 GHz windows machine. They started off about 10-15 secs per skim page, then kept increasing to about 10 mins each skim page. The total skim loading is taking about 18 hours. In the memory profile output, the uss stopped reporting at about 21 GB (11 hours in), when full_rss was at 128 GB.

joecastiglione commented 2 months ago

This pattern looks quite different than what @jpn-- profiled earlier in this issue which was flat and low, then kinda flat and much higher with extreme intermittent spikes.

dhensle commented 1 month ago

@dhensle, you were going to perhaps study testing skim load times with cropped skims. The tool to convert OMX files from zlib to blosc2:zstd is wring, install with pip install wring[omx].

I tried with the cropped skims. There doesn't seem to be any noticeable difference between them:

With blosc2 compression takes 19 seconds

06/05/2024 22:19:14 - INFO - activitysim.core.skim_dataset - writing skims to shared memory
06/05/2024 22:19:33 - INFO - activitysim.abm.models.initialize - initialize_landuse.annotate_tables - annotating land_use SPEC annotate_landuse

With zlib compression takes 18 seconds

06/05/2024 22:35:55 - INFO - activitysim.core.skim_dataset - writing skims to shared memory
06/05/2024 22:36:13 - INFO - activitysim.abm.models.initialize - initialize_landuse.annotate_tables - annotating land_use SPEC annotate_landuse

The load time difference appears to be size dependent....

i-am-sijia commented 1 month ago

Profiled skim loading without sharrow.

activitysim (activitysim/main): 50c05c50b11ed7436d376ac8f4df33baa46af5f7 skim data: zlib compressed

The sharrow off mode loads skims sequentially. On a 512 GB, 2.44 GHz windows machine, it took 20 mins to load 1755 individual skim cores. The average load time per skim is about 0.7 seconds (max 1.2 seconds). I do not see an irregular pattern.

i-am-sijia commented 1 month ago

In the sharrow off mode, there is an allocate_skim_buffer shared False taz shape (1755, 4947, 4947) total size: 171_799_119_180 (171.8 GB) step, which creates a buffer of RAM for the estimated skim size before loading skims. It seems to be doing the "allocating memory before loading" job as we discussed on the call. @jpn--

https://github.com/ActivitySim/activitysim/blob/1cb48c7acc2c377f4822e2521a19503842aaa495/activitysim/core/skim_dict_factory.py#L516-L518

However, when I looked at the memory profile output, I did not see a large memory being created and hold at the allocate_skim_buffer step. I only saw memory gradually picking up as the skims being loaded.

I then ran the model again, this time I paused the run at the skim_buffer object, I could see about 180 GB RAM is being hold at that point. When I resumed the run, I saw the memory actually being dropped before loading skims. I wonder why didn't I see this memory buffer in the previous run when I did not pause. Perhaps it happened too fast, less than the 0.5 sec interval for memory reporting.

jpn-- commented 1 month ago

I have rewritten the piece of skim loading code to remove dask, but it didn't totally fix the problem. Skim core loading times shown, still using zlib files. The pattern remains where it slows down shortly after moving to transit (but not exactly when moving to transit).

An interesting detail in this figure: in the left section there are two shoulders where load times run a bit over a second per core, and a central area where the load times run a bit below a second. The left shoulder is during the load of the EA time period (at which time memory is being allocated for all 5 time periods for each named skim core. The middle lower area is when the other time periods are being loaded into previously allocated regions of memory. Then the right shoulder, before things get bad, is back in the EA time period for transit.

Adding: running the same experiment on the same machine with Blosc2 compressed skims gave weird results, with traffic skims (which came first) loading much slower but transit skims loading much faster.

jpn-- commented 1 month ago

CS has re-run the experiments, over the weekend when usage of the same server by other projects is limited. It appears our performance irregularities may be caused by resource competition from other unrelated processes running on the same machine. In the re-run, we observe much better performance overall, as well as an anomaly in the middle of the blosc loading (orange) that is likely due to resource competition

Total load time for zlib skims was about 2 hours, and for blosc skims about 30 minutes.

Also, @i-am-sijia replicated the same experiments on WSP's server, which is apparently faster. She observed similar patterns, with no anomaly.

Total load time for zlib skims was about 25 minutes, and for blosc about 9 minutes.

dhensle commented 1 month ago

Tried with the SEMCOG model which has a skim file of 2.3 GB on disk and 65 GB in memory. In a non-sharrow run, skim load times are about 3.5 mins. With sharrow turned on, that jumps to about 14 mins. This is using the standard zlib skims.

Converting the skims to blosc2 format using wring took 7.7 minutes. Trying again with the blosc2 skims took 3 minutes.

(Tangentially related -- the SEMCOG model doesn't have the landuse data sorted by MAZ by default. This can be fixed easily in the code by sorting the dataframe before the sharrow check, as seen in this commit: https://github.com/ActivitySim/activitysim/pull/867/commits/96d4bb6bea5a5e15db6f632c365fa504e94715c2)

i-am-sijia commented 1 month ago

Update on the abm3 tests

Run ActivitySim with xxx_NUM_THREADS settings turned off (by commenting out source codes).
Run ActivitySim with MKL_NUM_THREADS = 1,OMP_NUM_THREADS = 1,OPENBLAS_NUM_THREADS = 1,NUMBA_NUM_THREADS = 1,VECLIB_MAXIMUM_THREADS = 1,NUMEXPR_NUM_THREADS = 1.
In the notebook that tests skim loading with Sharrow outside of ActivitySim, set xxx_NUM_THREADS to 1 (because previously it was not set) and test again. It took about 25 mins to run, same as before.
Run ActivitySim with memory profiling (every .5 sec) turned off.
Run ActivitySim with the default ActivitySim memory reporting commented out.

None of the tests showed a significant improvement of the skim loading time in ActivitySim.

ActivitySim / sandag-abm3-example

Very long time to write skims to shared data buffer when using sharrow #8