maz_to_maz tables potentially cause duplicate index key error in los.py when using zero-based zone_id setting

ActivitySim / activitysim

An Open Platform for Activity-Based Travel Modeling

https://activitysim.github.io

BSD 3-Clause "New" or "Revised" License

189 stars 96 forks source link

maz_to_maz tables potentially cause duplicate index key error in los.py when using zero-based zone_id setting #640

Open stefancoe opened 1 year ago

stefancoe commented 1 year ago

This error can happen here when using MAZ IDs in the maz_to_maz tables that are higher than the maximum zero-based MAZ ID. This code is used to calculate the index of these files: df["i"] = df.OMAZ * self.maz_ceiling + df.DMAZ

For example, our data has zone pairs 48-4903 & 55-997, with a max ceiling of 558, which makes df['i'] = 31687 in both cases.

The reason our MAZ labels are so much higher than the zero-based maximum is because I am using a subset of our region for a testing dataset. However, this still happens in our full sized data presumably because zones without emp/hhs are removed in the psrc_crop script. So we have IDs in maz_to_maz_walk file, for example, that are higher than the zero-based max.

It seems like the maz_to_maz tables need to get the zero-based treatment as well?

jpn-- commented 1 year ago

Yeah, that's a bug. When it's used, zero-based encoding should get applied immediately on load (as soon as possible after pd.read_csv) to any file that has MAZ or TAZ ids in it. I will fix it.

stefancoe commented 1 year ago

@jpn-- Thank you!

stefancoe commented 1 year ago

@jpn-- Any update on this? Thanks!

jpn-- commented 1 year ago

Yes, I've found the fix in #643 is not robust across multiple use cases, including multiprocessing and large-value MAZ ID's. The problem also interacts with / is related to #652. I'm trying to get to a solution that works for all, but it looks like it will require duplicating the TAZ table, so one copy can be sliced for multiprocessing and the other not.

stefancoe commented 1 year ago

@jpn-- Thank you!

stefancoe commented 1 year ago

@jpn-- it looks like self.maz_ceiling can be either a pandas object or plain int. If the latter, the line below fails because the .astype method/function is not available.

https://github.com/camsys/activitysim/blob/generic-whale/activitysim/core/los.py#L313

stefancoe commented 1 year ago

@jpn-- For two-zone systems, it seems like self.maz_ceiling is always going to be an int: https://github.com/camsys/activitysim/blob/generic-whale/activitysim/core/los.py#L286

So self.maz_ceiling.astype() wont work.

jpn-- commented 1 year ago

I'm not sure, but I think the problem with max_ceiling being a plain integer is a "sometimes" problem, maybe platform (win/mac/linux) related, or possibly due to differences in dependency versions (which could also be indirectly platform related).

In any case, I think this should be addressed by https://github.com/camsys/activitysim/commit/4dfbbeb17a69c3adbcc03bdfb49aaba19398cfef which changes from x.astype(np.int64) to np.int64(x), which should now evaluate correctly regardless if max_ceiling is a plain Python integer or a numpy scalar.

stefancoe commented 1 year ago

Thanks! I see now- it does seem like it should work the way it was coded since .max should return a scalar. Anyhow, thanks for the fix.

https://pandas.pydata.org/pandas-docs/version/1.5/reference/api/pandas.Series.max.html