Fastest way to access skims

fscottfoti commented 9 years ago

@jiffyclub At some point pretty soon we'll want to diagnose the fastest way to access skims. Given that we store the skims in OMX format (we might want to consider packing multiple matrices into a single h5 for convenience?), the big question is how to store/access them in memory.

Given our recent history with the .loc command I'm guessing storing zone_ids directly is basically a non-starter. Fortunately, we're storing a dense matrix so we can make sure every zone_id is in position 1 greater than it's index (i.e. zone 1 is in index 0). That way we can either 1) have a dataframe with a multi-index and call .take or 2) have a 2-D numpy array and access then directly, but only for one column at a time. Do we think that 1) is slower than 2) because 1) is definitely more attractive from a code perspective. I guess this "stacked" vs "unstacked" format.

At any rate, we should probably write a small abstraction to hide this from the user. Basically we pass in one of the formats above with dimension N and then pass in two series of "origin" and "destination" zone ids and get back the values.

jiffyclub commented 9 years ago

We can definitely have more than one matrix per file. I imagine having two stores: one pandas.HDFStore for tables and Series, and another OMX HDF5 file for NumPy arrays.

jiffyclub commented 9 years ago

One thing to keep in mind about .take is that it works with location based indexes, you're doing essentially the same procedure you'd be doing with NumPy arrays. If you had zone IDs you'd still need to translate them into locations before using .take.

fscottfoti commented 9 years ago

Right, I assume we have to make the assumption that zone ids have a deterministic relationship with indexes in order to make this work fast.

jiffyclub commented 9 years ago

With that assumption we're definitely going to get the best performance from NumPy arrays. Pandas is slower at pretty much any size, but indexing Pandas things really doesn't scale well with size.

screenshot 2014-12-15 12 18 57

fscottfoti commented 9 years ago

OK - we should create a small matrix object to wrap up the numpy methods we'll need. There won't be too many, at least at first.

jiffyclub commented 9 years ago

What are the approximate sizes of the things we're talking about? I think you've said skims will be roughly ~1000 x ~1000. How many numbers will we be pulling out of that at once?

fscottfoti commented 9 years ago

Right now it's about 1500x1500, but liable to go up significantly in the future. I would imagine it would have to fit in memory or it would be quite slow. We will pull on the order of several million numbers at once, if we can.

e-lo commented 9 years ago

Apologies if this comment is off base...

Multiple tables/file While this simplifies things for a specific situation, it also is a tad inflexible. At SFCTA, we have opted to not do this for several reasons that we have encountered thusfar:

If we want to distribute something to another machine or open a file...it is a lot better to be specific for network I/O, RAM, and CPU
Sending and using skim files for post-processing utilities or giving them out to consultants or other users. Smaller and more specific the better.
Easier to see where differences are between runs and troubleshooting when they are separate (i.e. by using Git binary diff or other)

There are probably other reasons to keep them separate (and plenty of reasons to keep them together), but that is my 2 cents.

e-lo commented 9 years ago

Just another comment that any abstractions or things that would be useful to go into OMX itself should go there instead of ActivitySim :-)

DavidOry commented 9 years ago

Two potentially helpful notes:

(1) Roughly speaking, in current practice we tend to keep square matrices with spatial units in the 5k to 10k range and move to non-square matrices with spatial units in the 10k to 40k range. In the Travel Model Two system, we have three spatial systems rather than one, with approximate sizes: 6,000 (auto movements), 5,000 (transit stops), and 40,000 (human movements).

(2) Many of these matrices are sparse. As you go, you'll learn that we often do not need to store zero values.

e-lo commented 9 years ago

@DavidOry - it might be helpful to clarify the units of your sizes (#rows/columns, cells, or MB)

DavidOry commented 9 years ago

So in Travel Model Two, we have:

(1) Travel Analysis Zones, ~6,000. We skim automobile travel time, distance, cost, etc. So we end up with a whole bunch of 6,000 x 6,000 matrices.

(2) Transit Access Points, ~5,000. We skim transit stop-to-stop times, fares, transfers, etc. So we end up with a whole bunch of 5,000 x 5,000 matrices.

(3) Micro-Analysis Zones, ~30,000. We skim nearby automobile travel time, walk time, bicycle time, walk/drive to Transit Access Point time. So we have 30,000 rows by X column data files, where X is a function of distance thresholds, but is around ~5,000. So we have a bunch of 30,000 rows x ~5,000 column data files.

All of these are read into memory.

@e-lo : Does that help?

e-lo commented 9 years ago

Exactly @DavidOry

And from our call last week it sounded like currently in CT RAMP all of those are read into memory at once at the outset of each full iteration of the travel model. I'm sure there is probably a way to simplify this task other than having to open them for each instance in the event that we parallelize the problem. maybe i hope?

DavidOry commented 9 years ago

Currently CT-RAMP reads them in whenever it encounters a model that uses them. Because mode choice uses just about all of them and one of the first tasks is to compute mode choice logsums, they are, for all intents and purposes, all read in and stored in memory at the beginning and updated at each global iteration. Importantly, a copy of the matrices is then made on each computer doing the parallel computations. So if the matrices have a memory footprint of, say, 50 GB, you need 50 GB of RAM on each computer you want to contribute to the effort. If your other model components have a footprint of say, 75 GB, you end up needing several computers with 125 GB of RAM to do the computations. That's where we are at now and part of @danielsclint frustration with not being able to distribute the tasks to smaller CPUs.

fscottfoti commented 9 years ago

Wanted to renew this thread as well. A few separate issues...

I'm not too concerned with the dense (square) vs sparse matrices issue. Pandas has a stack and unstack operation which essentially transforms from one to the other if used with the appropriate indexes. stack would go from dense to sparse and unstack would do the opposite. For now we will do the square matrices but we will keep in mind that sparse matrices will be used when the matrices get too large.
For the issue of storing one or more matrices per file, I think that's not too big a deal either. hdf5 is a pretty lightweight file format - we can choose to put one matrix in there or multiple matrices in there for different use cases. Generally, I can definitely see passing one matrix at a time from machine to machine, but if we end up using most of the matrices for the bottleneck of the model it might be convenient to store them together (it's essentially like a zip file, not really a format).
As for the question of the assumption of whether zone_ids can be indexes (zero based and contiguous) it sounds like that is not guaranteed at all. From a computational perspective though, it is required, so we will have a step at the beginning to convert labels to indexes, and for the early stages of the project we will assume that zone ids will be zero-based and contiguous (and that we'll have to fix that before we're through).
My biggest concern is just how many skims will be necessary for which models, and if there are any ways to reduce memory or parallelize computation. Generally we should get this working functionally first and we will keep an eye out for problems that can be made more efficient as we go. It's likely these issues will come up again and again.

guyrousseau commented 9 years ago

Currently @ARC with CT-RAMP, in theory, the JPPF Driver application balances the load across the nodes so as to avoid the situation of machines sitting and waiting for others to finish. However, the balancing algorithm depends upon the configuration of a set of parameters specified in the jppf properties files. We have not had the opportunity to re-examine and fine-tune the settings since our new server was added to our cluster. Similarly, the thread number settings could also be further tweaked. So in our case, typically a single, best performing server, does a lot of the heavy lifting and probably handles 50% of the tasks. Thus, the workload being handled by each of the remaining machines (servers) probably does not justify the associated overhead relating to skim loading and task communications across the network.

guyrousseau commented 9 years ago

sorry I did not mean to close this thread of conversation

ActivitySim / activitysim

Fastest way to access skims #8