UDST / sanfran_urbansim

An UrbanSim for San Francisco: an example implementation of the new framework
39 stars 27 forks source link

"KeyError: 'building_id'" when running lcm_simulate #19

Open lisalan520 opened 9 years ago

lisalan520 commented 9 years ago

Hi,

I also have problem running 'hlcm_simulate' & 'elcm_simulate' models using my own data. It raised keyerror: 'building_id' for both models. I've checked my data and found nothing weird. I've also managed to break the model to individual steps and run them one by one. Do you have any idea what could be wrong? Thank you!

Here is my error message:

Running model 'hlcm_simulate'
There are 450501 total available units
    and 359815 total choosers
    but there are 0 overfull buildings
    for a total of 90686 temporarily empty units
    in 81292 buildings total in the region
Assigned 0 choosers to new units

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-012e08452343> in <module>()
----> 1 sim.run(["hlcm_simulate"])

C:\Anaconda\lib\site-packages\urbansim\sim\simulation.pyc in run(models, years, data_out, out_interval)
   1458                 model = get_model(model_name)
   1459                 t2 = time.time()
-> 1460                 model()
   1461                 print("Time to execute model '{}': {:.2f}s".format(
   1462                       model_name, time.time()-t2))

C:\Anaconda\lib\site-packages\urbansim\sim\simulation.pyc in __call__(self)
    670             kwargs = _collect_variables(names=self._argspec.args,
    671                                         expressions=self._argspec.defaults)
--> 672             return self._func(**kwargs)
    673 
    674     def _tables_used(self):

C:\Users\xzhang\Documents\PythonScripts\Marion_urbansim_test_0514_with_building_ids\models.pyc in hlcm_simulate(households, buildings, zones)
     39     return utils.lcm_simulate("hlcm.yaml", households, buildings, zones,
     40                               "building_id", "residential_units",
---> 41                               "vacant_residential_units")
     42 
     43 

C:\Users\xzhang\Documents\PythonScripts\Marion_urbansim_test_0514_with_building_ids\utils.pyc in lcm_simulate(cfg, choosers, buildings, nodes, out_fname, supply_fname, vacant_fname)
    198 
    199     # go from units back to buildings
--> 200     new_buildings = pd.Series(units.ix[new_units.values][out_fname].values,
    201                               index=new_units.index)
    202 

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1676             return self._getitem_multilevel(key)
   1677         else:
-> 1678             return self._getitem_column(key)
   1679 
   1680     def _getitem_column(self, key):

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _getitem_column(self, key)
   1683         # get column
   1684         if self.columns.is_unique:
-> 1685             return self._get_item_cache(key)
   1686 
   1687         # duplicate columns & possible reduce dimensionaility

C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
   1050         res = cache.get(item)
   1051         if res is None:
-> 1052             values = self._data.get(item)
   1053             res = self._box_item_values(item, values)
   1054             cache[item] = res

C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in get(self, item, fastpath)
   2563 
   2564             if not isnull(item):
-> 2565                 loc = self.items.get_loc(item)
   2566             else:
   2567                 indexer = np.arange(len(self.items))[isnull(self.items)]

C:\Anaconda\lib\site-packages\pandas\core\index.pyc in get_loc(self, key)
   1179         loc : int if unique index, possibly slice or mask if not
   1180         """
-> 1181         return self._engine.get_loc(_values_from_object(key))
   1182 
   1183     def get_value(self, series, key):

C:\Anaconda\lib\site-packages\pandas\index.pyd in pandas.index.IndexEngine.get_loc (pandas\index.c:3656)()

C:\Anaconda\lib\site-packages\pandas\index.pyd in pandas.index.IndexEngine.get_loc (pandas\index.c:3534)()

C:\Anaconda\lib\site-packages\pandas\hashtable.pyd in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:11911)()

C:\Anaconda\lib\site-packages\pandas\hashtable.pyd in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:11864)()

KeyError: 'building_id'
fscottfoti commented 9 years ago

I think (I'm not totally sure) that you need a building_id in the households table. Do you have that column? The vacant_residential_units requires there to be a building_id.

lisalan520 commented 9 years ago

Thanks for replying! I have 'building_id' column in both my households & jobs data. When I did the step-by-step test, it has no problem in calculating the vacant_residential_units...

fscottfoti commented 9 years ago

OK - yeah I've definitely seen this before but am having a hard time remembering the problem.

I would try 2 things - first, try naming the index of your buildings table -

buildings.index.name = 'building_id'

then I would double check the index for duplicates

pd.Series(buildings.index).value_counts() and see if the top row has a value > 1.

lisalan520 commented 9 years ago

I've checked my buildings data with the method you suggested. There is no duplicates in buildings.index. And I got same error message...

fscottfoti commented 9 years ago

And you tried changing the name of the index?

On Tue, Jun 2, 2015 at 1:55 PM lisalan520 notifications@github.com wrote:

I've checked my buildings data with the method you suggested. There is no duplicates in buildings.index. And I got same error message...

— Reply to this email directly or view it on GitHub https://github.com/synthicity/sanfran_urbansim/issues/19#issuecomment-108096673 .

lisalan520 commented 9 years ago

Yes I did. Could it be related to data type?

fscottfoti commented 9 years ago

Is building_id a float because there are nans? If so, that is very likely it - is should be an int column.

lisalan520 commented 9 years ago

The building_id is an int column. Should it be consecutive integers? I have something like this [1,2,3,5,9,10] will this be an issue?

fscottfoti commented 9 years ago

Definitely does NOT have to consecutive.

jiffyclub commented 9 years ago

Are you sure you're talking about the right table? The error is occurring here:

C:\Users\xzhang\Documents\PythonScripts\Marion_urbansim_test_0514_with_building_ids\utils.pyc in lcm_simulate(cfg, choosers, buildings, nodes, out_fname, supply_fname, vacant_fname)
    198 
    199     # go from units back to buildings
--> 200     new_buildings = pd.Series(units.ix[new_units.values][out_fname].values,
    201                               index=new_units.index)
    202 

And I think the most likely way to get a KeyError there is if the units DataFrame doesn't have a 'building_id' column. Which table is units?

fscottfoti commented 9 years ago

Units comes from here:

https://github.com/synthicity/urbansim_defaults/blob/master/urbansim_defaults/utils.py#L358

It's an expansion of the original buildings table and it needs a building_id to get back to the buildings.

I really think the building_id comes from the call to .reset_index() right there and that the index has to be named building_id to get the building_id column there. If the index is named building_id, I'm not sure why it wouldn't have the column after that

lisalan520 commented 9 years ago

From my step-by-step test, the unit table looks like this:

image

and it does has building_id ..

jiffyclub commented 9 years ago

It looks like @lisalan520 is not using the same version of lcm_simulate @fscottfoti linked to. Any reason to think that could be a problem?

fscottfoti commented 9 years ago

It seems like it's a lot different - the line number has gone from 200 to 437. @lisalan520 what version are you using?

I don't know for sure, but it's definitely possible the new version would fix the problem. I have made some small changes in the function in the past 2-3 months. If we know what version @lisalan520 is running maybe we can diff them?

lisalan520 commented 9 years ago

The lcm_simulate I used comes from here:

https://github.com/synthicity/sanfran_urbansim/blob/master/utils.py

I have the same code as in the link. I'm using UrbanSim 1.3. I'll try to update my urbansim to see whether it solves the problem.

lisalan520 commented 9 years ago

Seems the two 2.0 versions both use discrete choice model, which should not solve the problem here. I'll try to run discrete choice model again and hope my computer can afford it this time. Many thanks!

fscottfoti commented 9 years ago

So just to be clear, when you print out units building_id is there, and we're looking at the expression units.loc[new_units.values][out_fname] where out_fname is equal to building_id so can you print out units.loc[new_units.values]? - somehow building_id is missing from the result? What is the expression equal to?

lisalan520 commented 9 years ago

'units' is a dis-aggregated table of 'buildings' according to the vacant_units value. 'new_units' comes from lcm model predict. 'new_units.values' is used to pick rows from 'units' where 'units.index = new_units.values'

Here is a capture:

image

fscottfoti commented 9 years ago

So can you then run units.loc[new_units.values][out_fname]? What am I missing?

lisalan520 commented 9 years ago

I was able to run units.loc[new_units.values]["building_id"] to get the results. But when I define out_fname= building_id, I cannot run units.loc[new_units.values][out_fname] here.

fscottfoti commented 9 years ago

Can you print units.columns? Grasping at straws here...

lisalan520 commented 9 years ago

Here it is: image

fscottfoti commented 9 years ago

Interesting - and you put building_id in quotes above so that it's a string? Not sure what's going on here, but it's definitely a Pandas issue - there's no UrbanSim happening here that I can see.

jiffyclub commented 9 years ago

Note that in the code @lisalan520 is using it's using .ix, not .loc. Wonder if that's making a difference.

lisalan520 commented 9 years ago

I tried both .loc and .ix and they have the same problem with using building_id without quotes. image

jiffyclub commented 9 years ago

You're not going to be able to use building_id without quotes, it has to be a string or a variable that refers to a string.

jiffyclub commented 9 years ago

@lisalan520 You're not in SF, are you? I wish I could debug this in person. We might also be able to use a Google Hangout, I think I can drive your computer from those.

lisalan520 commented 9 years ago

Sorry for some reason I thought it was building_id in my models.py. I just checked it and it was "building_id" when I got the error...

I'm in Indianapolis. I'll check if I can use Google Hangout on this computer. Thanks!

lisalan520 commented 9 years ago

Hi @jiffyclub I think we can try Google Hangouts. So how do I connect with you?

jiffyclub commented 9 years ago

You can join me here: https://plus.google.com/hangouts/_/gvttqyhgmmnclmstprvgwmrhwma

jiffyclub commented 9 years ago

Just had a call with @lisalan520 and for some reason for her the expression

    units = locations_df.loc[np.repeat(vacant_units.index.values,
                             vacant_units.values.astype('int'))].reset_index()

is resulting in the 'buildings_id' label on locations_df.index being dropped. She's using Pandas 0.14.1 and is going to try updating to 0.16.1 to see if that has been fixed (I suspect it has been fixed, since @fscottfoti hasn't run into the same problem).

lisalan520 commented 9 years ago

Many thanks @jiffyclub !

I could only update my Pandas to 0.16.0 due to our firewall. The problem was still there. At this moment I don't think the error comes from pandas but I will continue to update it to 0.16.1.

Meanwhile, with the problem we've found, I changed out_fname to 'index' in the code:
new_buildings = pd.Series(units.loc[new_units.values]['index'].values, index=new_units.index)

and the model works fine after this change. Though I still don't understand why "building_id" turns into "index" in the loc() function..

But it seems the problem is solved for now. Thank you very much! I really appreciate your help!

jiffyclub commented 9 years ago

So weird that locations_df.reset_index() preserves the name, but locations_df.loc[].reset_index() doesn't! But glad you have something working.