SEL-Columbia / bamboo

Dynamic data analysis over the web. The logic to your data dashboards.
http://bamboo.io
BSD 3-Clause "New" or "Revised" License
157 stars 20 forks source link

Column join between existing datasets on key #248

Closed pld closed 11 years ago

prabhasp commented 11 years ago

From pandas paper, df1.join(df2, on='key') looks pretty promising (where key is a shared column name, exactly as we need).

pld commented 11 years ago

Join requires that the df1 and df2 have disjoint columns, I think we'll want merge(df1, df2, on='key'), actually maybe not since this duplicates rows.

General merge, etc docs: http://pandas.pydata.org/pandas-docs/dev/merging.html#brief-primer-on-merge-methods-relational-algebra

Join definition: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html?highlight=join#pandas.DataFrame.join

Merge definition: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge

pld commented 11 years ago

If we assume the join column has unique entries, here is a way to merge,

Assuming we have a population_df with the columns lga and population, where column lga is unique, and a water_df with a bunch of columns including the column lga (which need not be unique) and without the column population, in pandas:

population_df = population_df.set_index('lga')
merged_df = water_df.join(population_df, on='lga')

We can wrap this so your API call is something like (since it's a post I'm showing the query params as post data):

POST /datasets/[water_df ID]/merge

DATA:
* other_dataset_id=[population_df ID]
* on=[join column name, e.g. 'lga']

@prabhasp Does this functionality sound like a good first start for implementation? @mejymejy We might want to make the API more distinct from the row merge

prabhasp commented 11 years ago

On Mon, Oct 29, 2012 at 6:14 PM, Peter Lubell-Doughtie < notifications@github.com> wrote:

If we assume the join column has unique entries, here a way to merge,

Assuming we have a population_df with the columns lga and population, where column lga is unique, and a water_df with a bunch of columns including the column lga (which need not be unique) and without the column population, in pandas:

population_df = population_df.set_index('lga')merged_df = water_df.join(population_df, on='lga')

We can wrap this so your API call is something like (since it's a post I'm showing the query params as post data):

POST /datasets/[water_df ID]/merge

DATA:

  • other_dataset_id=[population_df ID]
  • on=[join column name, e.g. 'lga']

@prabhasp https://github.com/prabhasp Does this functionality sound like a good first start for implementation?

"Does this functionality sound like a good first start for implementation" --> Great! We should have a good error message for when population_df's lga column isn't unique, but otherwise, this is exactly it.

@mejymejy https://github.com/mejymejy We might want to make the API more district from the row merge