Closed fscottfoti closed 8 years ago
We can definitely have more than one matrix per file. I imagine having two stores: one pandas.HDFStore
for tables and Series, and another OMX HDF5 file for NumPy arrays.
One thing to keep in mind about .take
is that it works with location based indexes, you're doing essentially the same procedure you'd be doing with NumPy arrays. If you had zone IDs you'd still need to translate them into locations before using .take
.
Right, I assume we have to make the assumption that zone ids have a deterministic relationship with indexes in order to make this work fast.
With that assumption we're definitely going to get the best performance from NumPy arrays. Pandas is slower at pretty much any size, but indexing Pandas things really doesn't scale well with size.
OK - we should create a small matrix object to wrap up the numpy methods we'll need. There won't be too many, at least at first.
What are the approximate sizes of the things we're talking about? I think you've said skims will be roughly ~1000 x ~1000. How many numbers will we be pulling out of that at once?
Right now it's about 1500x1500, but liable to go up significantly in the future. I would imagine it would have to fit in memory or it would be quite slow. We will pull on the order of several million numbers at once, if we can.
Apologies if this comment is off base...
Multiple tables/file While this simplifies things for a specific situation, it also is a tad inflexible. At SFCTA, we have opted to not do this for several reasons that we have encountered thusfar:
There are probably other reasons to keep them separate (and plenty of reasons to keep them together), but that is my 2 cents.
Just another comment that any abstractions or things that would be useful to go into OMX itself should go there instead of ActivitySim :-)
Two potentially helpful notes:
(1) Roughly speaking, in current practice we tend to keep square matrices with spatial units in the 5k to 10k range and move to non-square matrices with spatial units in the 10k to 40k range. In the Travel Model Two system, we have three spatial systems rather than one, with approximate sizes: 6,000 (auto movements), 5,000 (transit stops), and 40,000 (human movements).
(2) Many of these matrices are sparse. As you go, you'll learn that we often do not need to store zero values.
@DavidOry - it might be helpful to clarify the units of your sizes (#rows/columns, cells, or MB)
So in Travel Model Two, we have:
(1) Travel Analysis Zones, ~6,000. We skim automobile travel time, distance, cost, etc. So we end up with a whole bunch of 6,000 x 6,000 matrices.
(2) Transit Access Points, ~5,000. We skim transit stop-to-stop times, fares, transfers, etc. So we end up with a whole bunch of 5,000 x 5,000 matrices.
(3) Micro-Analysis Zones, ~30,000. We skim nearby automobile travel time, walk time, bicycle time, walk/drive to Transit Access Point time. So we have 30,000 rows by X column data files, where X is a function of distance thresholds, but is around ~5,000. So we have a bunch of 30,000 rows x ~5,000 column data files.
All of these are read into memory.
@e-lo : Does that help?
Exactly @DavidOry
And from our call last week it sounded like currently in CT RAMP all of those are read into memory at once at the outset of each full iteration of the travel model. I'm sure there is probably a way to simplify this task other than having to open them for each instance in the event that we parallelize the problem. maybe i hope?
Currently CT-RAMP reads them in whenever it encounters a model that uses them. Because mode choice uses just about all of them and one of the first tasks is to compute mode choice logsums, they are, for all intents and purposes, all read in and stored in memory at the beginning and updated at each global iteration. Importantly, a copy of the matrices is then made on each computer doing the parallel computations. So if the matrices have a memory footprint of, say, 50 GB, you need 50 GB of RAM on each computer you want to contribute to the effort. If your other model components have a footprint of say, 75 GB, you end up needing several computers with 125 GB of RAM to do the computations. That's where we are at now and part of @danielsclint frustration with not being able to distribute the tasks to smaller CPUs.
Wanted to renew this thread as well. A few separate issues...
stack
and unstack
operation which essentially transforms from one to the other if used with the appropriate indexes. stack
would go from dense to sparse and unstack
would do the opposite. For now we will do the square matrices but we will keep in mind that sparse matrices will be used when the matrices get too large.Currently @ARC with CT-RAMP, in theory, the JPPF Driver application balances the load across the nodes so as to avoid the situation of machines sitting and waiting for others to finish. However, the balancing algorithm depends upon the configuration of a set of parameters specified in the jppf properties files. We have not had the opportunity to re-examine and fine-tune the settings since our new server was added to our cluster. Similarly, the thread number settings could also be further tweaked. So in our case, typically a single, best performing server, does a lot of the heavy lifting and probably handles 50% of the tasks. Thus, the workload being handled by each of the remaining machines (servers) probably does not justify the associated overhead relating to skim loading and task communications across the network.
sorry I did not mean to close this thread of conversation
@jiffyclub At some point pretty soon we'll want to diagnose the fastest way to access skims. Given that we store the skims in OMX format (we might want to consider packing multiple matrices into a single h5 for convenience?), the big question is how to store/access them in memory.
Given our recent history with the
.loc
command I'm guessing storing zone_ids directly is basically a non-starter. Fortunately, we're storing a dense matrix so we can make sure every zone_id is in position 1 greater than it's index (i.e. zone 1 is in index 0). That way we can either 1) have a dataframe with a multi-index and call.take
or 2) have a 2-D numpy array and access then directly, but only for one column at a time. Do we think that 1) is slower than 2) because 1) is definitely more attractive from a code perspective. I guess this "stacked" vs "unstacked" format.At any rate, we should probably write a small abstraction to hide this from the user. Basically we pass in one of the formats above with dimension N and then pass in two series of "origin" and "destination" zone ids and get back the values.