UDST / orca

Python library for task orchestration
https://udst.github.io/orca/
BSD 3-Clause "New" or "Revised" License
53 stars 21 forks source link

Problem merging tables with overlapping broadcast relationships #38

Open smmaurer opened 6 years ago

smmaurer commented 6 years ago

I'm having trouble merging sets of tables with overlapping broadcast relationships.

For example, these combinations run:

But this combination raises an error:

This came up in real-world use (https://github.com/ual/urbansim_parcel_bayarea/issues/11), but here's a stand-alone demonstration that you can paste into a python script:

import orca
import pandas as pd

a = pd.DataFrame({'ix': [1,2], 'val_a': ['a1','a2']})
b = pd.DataFrame({'ix': [1,2], 'val_b': ['b1','b2'], 'a': [1,2]})
c = pd.DataFrame({'ix': [1,2], 'val_c': ['c1','c2'], 'a': [1,2], 'b': [1,2]})

orca.add_table('a', a.set_index('ix'))
orca.add_table('b', b.set_index('ix'))
orca.add_table('c', c.set_index('ix'))

orca.broadcast(cast='a', onto='b', cast_index=True, onto_on='a')
orca.broadcast(cast='b', onto='c', cast_index=True, onto_on='b')

df = orca.merge_tables(target='c', tables=['c', 'b', 'a'])

orca.broadcast(cast='a', onto='c', cast_index=True, onto_on='a')

df = orca.merge_tables(target='c', tables=['c', 'b', 'a'])  # throws error

Here is the error:

  File "test.py", line 19, in <module>
    df = orca.merge_tables(target='c', tables=['c', 'b', 'a'])  # error on this line
  File "/Users/maurer/Dropbox/Git-imac/udst/orca/orca/orca.py", line 1799, in merge_tables
    cast_table = frames[cast]
KeyError: 'a'
Twin-Clouds-iMac:Desktop maurer$ python test.py
Traceback (most recent call last):
  File "test.py", line 19, in <module>
    df = orca.merge_tables(target='c', tables=['c', 'b', 'a'])  # throws error
  File "/Users/maurer/Dropbox/Git-imac/udst/orca/orca/orca.py", line 1799, in merge_tables
    cast_table = frames[cast]
KeyError: 'a'

This is a bug, right? I can see how it's a potentially ambiguous merge, but if we just resolve it in a consistent way it seems like a supportable use case. Overlapping broadcasts are helpful if you want to do different merge combinations at different times with maximum efficiency.

I don't see an obvious source for the error, but will dig into it more when I have a chance.

I'm running Orca 1.5.1 and Pandas 0.22.0