UDST / choicemodels

Python library for discrete choice modeling
https://udst.github.io/choicemodels
BSD 3-Clause "New" or "Revised" License
74 stars 33 forks source link

Batch simulation to reduce to improve performance and decrease the risk of memory errors #47

Closed mxndrwgrdnr closed 5 years ago

mxndrwgrdnr commented 5 years ago

After a ton of debugging and deconstructing the new simulation code, I'm increasingly convinced that many of our data tables are just going to be too big to work with. In my case, when running simulation for WLCM with a chooser population of ~3 million workers, sampling 10 alternatives, using a model specification with roughly 30 covariates, and you are talking about very, very big MergedChoiceTables and patsy.dmatrixs. Not so big that they are guaranteed to break the process, but guaranteed that breaks will happen, especially in the context of constrained choice models. The iterative lottery choices will create and re-create these gigantic tables during each iteration. The crash typically occurs during probability generation here.

Anyways, the simplest solution I can think of would be to perform the simulation in batches, which could be done all the way at the level of the simulation model step for unconstrained choices, but its not quite so easy for constrained choices like WLCM, where the simulator needs to be aware of what choices have been taken off the market before the actual tables of alternatives have been updated.

I see that in line 173 here that there is already some sampling of the chooser table going on, even though the fraction value 1 means you're sampling the entire table. I'm wondering if it might not work to just expose that frac parameter up at a higher level in cases where the choice tables are expected be too large, and allow the iterative choices to proceed on a random sample of the choosers one at a time until the criteria for closing the while loop has been met. Another option would be to add an additional loop for processing the choosers table in batches above line 173 so the merged choice tables would always be much smaller than they would be otherwise.

I haven't completely thought through how this might introduce biases into the simulation process, because households would suddenly only be competing with other households within their batch for choices, as opposed to the entire population. Based on my understanding about how ties get broken currently (they get thrown back into the pot and we try again the next iteration, right?), I don't think either the sampling or batch approach to choice assignment should do much in the way of biasing our results. I'd love to get your opinion though.

I already implemented the sampling version and it seems to be working although I haven't validated the results yet: screen shot 2018-10-25 at 6 40 22 pm

The changes to the code are quite simple, just added a sample_frac param to the iterative_lottery_choices() function, and had the mct_callable() function on line 173 make use of it. You can see them on my branch here.

Anyways, let me know what you think, if your skeptical, if you think the batching would be better, etc.

waddell commented 5 years ago

This might be another argument in favor of batching based on smaller time steps like months or quarters, rather than years. We would take small chunks of households for HLCM, and locate them, competitively, and with market clearing in place, would be updating prices for the next chunk, etc. For WLCM we could do the same thing, and eventually consider using market clearing with wages adjusting.

mxndrwgrdnr commented 5 years ago

Interesting. @waddell are you suggesting just running HLCM and WLCM multiple times within the context of a single simulation year, or changing the overall unit of iteration of an urbansim run?

smmaurer commented 5 years ago

@mxndrwgrdnr This is an interesting problem and clever solution! I can't think of any way this would bias the choices. Seems like useful functionality.

Incidentally, choosers.sample(frac=1) was there because it's an efficient way to shuffle the order of the choosers between rounds. Convenient that it helps with this as well!