UDST / orca

Python library for task orchestration
https://udst.github.io/orca/
BSD 3-Clause "New" or "Revised" License
53 stars 21 forks source link

change suffix on merge #20

Closed fscottfoti closed 7 years ago

fscottfoti commented 7 years ago

This is something we probably should have done a long time ago - when you have columns defined on multiple tables in a merge - e.g. zone_id defined on parcels and buildings, then the result has zone_id_x and zone_id_y. With this change you'll have zone_id and zone_id_y, so as long as the two zone_ids are the same after merging, this is probably what the user expects.

I should note that I actually had code which depended on this - in one or two places I actually used zone_id_x instead of zone_id, so this does have the potential to break people's code. Probably worth it though.

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 96.523% when pulling b211725db51eb1bba75dc0c6568ea49b5aba72de on change-suffix-for-merge into d8df9f88b8a5b9f93b9168c22303bc5eebc9ef72 on master.

bridwell commented 7 years ago

What if instead of suffixes, the field was prefixed with each source table name so it was clear where the field came from? For example buildings_taz, households_taz.

On Wed, Mar 8, 2017 at 11:46 AM Coveralls notifications@github.com wrote:

[image: Coverage Status] https://coveralls.io/builds/10496160

Coverage remained the same at 96.523% when pulling b211725 https://github.com/UDST/orca/commit/b211725db51eb1bba75dc0c6568ea49b5aba72de on change-suffix-for-merge into d8df9f8 https://github.com/UDST/orca/commit/d8df9f88b8a5b9f93b9168c22303bc5eebc9ef72 on master.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UDST/orca/pull/20#issuecomment-285130647, or mute the thread https://github.com/notifications/unsubscribe-auth/AExeAEoLENR3e_pmJciE6vEZy1SWVPa7ks5rjveWgaJpZM4MXJ0n .

fscottfoti commented 7 years ago

I hear what you're saying. Let's see - here's my simple use case.

In [2]: buildings = orca.get_table('buildings')

In [3]: parcels = orca.get_table('parcels')

In [4]: buildings_df = orca.merge_tables(
             target='buildings',
             tables=[buildings, parcels],
             columns=['juris', 'zone_id', 'vacant_job_spaces'])

In [6]: buildings_df.columns
Out[6]: 
Index([u'parcel_id', u'vacant_job_spaces', u'zone_id',
       u'zone_id_y', u'juris'],
      dtype='object') 

So I have two tables I'm merging - nothing too complicated, and I'm asking for zone_id, and it's defined on both tables.

Basically on orca master I would get columns zone_id_x and zone_id_y back.

In the code above, which uses this PR I get back zone_id and zone_id_y.

And in your proposal we'd get back zone_id_buildings and zone_id_parcels.

I see what you're saying, but still, when I ask for zone_id and don't get back a zone_id, isn't that confusing? I would then have to know when I have a column defined in multiple places and compensate for that in the code. Also, in my case, they're always the same (though I can imagine cases where they wouldn't be).

What about getting back zone_id and zone_id_parcels?

I could also imagine asking for 'buildings.zone_id' in the columns list and getting back 'zone_id' in that case, but that would take some more work on the orca side.

What do you think?

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 96.523% when pulling 8788fc4463a5511781f36e0855288d4c93906257 on change-suffix-for-merge into d8df9f88b8a5b9f93b9168c22303bc5eebc9ef72 on master.

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 96.523% when pulling 8788fc4463a5511781f36e0855288d4c93906257 on change-suffix-for-merge into d8df9f88b8a5b9f93b9168c22303bc5eebc9ef72 on master.

HakimOuarasForcity commented 7 years ago

Hi, I noticed that if we call many times orca.merge_tables we replicate the columns with suffix _y indefinitely. Is it possible to avoid a such duplication? For example keep only the last one.

bridwell commented 7 years ago

This definitely gets tricky when merging more than 2 tables together. Consider the case that is provided in the orca docs, where we merge tables a, b, c, and d (https://udst.github.io/orca/core.html#automated-merges).

If the same column exists on all the tables (for example a column name col), we end up with the following fields:

However, if the same column exists on a, b, c but not on d:

And, has been previously noted, if the column exists on a and b, but not c and d:

I guess my point here is that is difficult to predict what column names you are going to get, based on which tables have which fields and their join ordering.

@fscottfoti, I think your idea of leaving the column name as is on the target table, and suffixing the other columns with their source tables, make sense. But any tests that get written probably need to include cases where 2, 3 and 4 tables share column names. I could also see adding some options to give users more control of what they're getting:

janowicz commented 7 years ago

Agreed that maintaining target table column name, and suffixing identically-named columns from other tables with their source table name, would be useful.

fscottfoti commented 7 years ago

A new try at this merge behavior. From the docstring...

drop_intersection : bool (default True) If True, keep the left most occurrence of any column name if it occurs on more than one table. This prevents getting back the same column with suffixes applied by pd.merge. If false, columns names will be suffixed with the table names - e.g. zone_id_buildings and zone_id_parcels.

I'll need to write tests, but there's no point writing tests unless we approve of the behavior so what do you think? Basically the default will give you back one column with the name you ask for which is from the left most table, or you can set to False and get table name suffixes.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.009%) to 96.532% when pulling e52fe6ccc914870dafe6779fb2f98f06544a92a5 on change-suffix-for-merge into d8df9f88b8a5b9f93b9168c22303bc5eebc9ef72 on master.

janowicz commented 7 years ago

@fscottfoti I like this new merge behavior

pksohn commented 7 years ago

@fscottfoti I think this looks good too and that you should go ahead and write tests. Let me know if you want some help. It would be great to squeeze this into the next release.

fscottfoti commented 7 years ago

I hear you on getting this in to the next release. I don't have time to work on it at the moment, but happy if you take it over to make the release.

pksohn commented 7 years ago

Sounds good. I'll start working on unit tests.

fscottfoti commented 7 years ago

BTW, I did eventually realize there's a problem here for an odd number of tables. Let's say we have households, buildings, and parcels, all of which have zone_id defined. For the first merge of households and buildings, the columns will becomes zone_id_households and zone_id_buildings, but then neither of these columns will have a name clash with the third, so instead of becoming zone_id_parcels, it just stays as zone_id. That will probably need to be fixed (funny how simple fixes get bigger and bigger).

Anyway, I'm back from my trip so can work on this if you haven't gotten too deep into it yet.

pksohn commented 7 years ago

I haven't done any work on this so go ahead!

fscottfoti commented 7 years ago

The new commit should do it - tests have been added

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.07%) to 96.589% when pulling 614b346261caca89df52f7fb19ca0abb671fb838 on change-suffix-for-merge into d8df9f88b8a5b9f93b9168c22303bc5eebc9ef72 on master.

pksohn commented 7 years ago

Looks good to me.

fscottfoti commented 7 years ago

We ready to merge? Everyone like this behavior ok?

pksohn commented 7 years ago

I think we're good. Merging this.