Closed pld closed 11 years ago
Join requires that the df1
and df2
have disjoint columns, I think we'll want merge(df1, df2, on='key')
, actually maybe not since this duplicates rows.
General merge, etc docs: http://pandas.pydata.org/pandas-docs/dev/merging.html#brief-primer-on-merge-methods-relational-algebra
Join definition: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html?highlight=join#pandas.DataFrame.join
Merge definition: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge
If we assume the join column has unique entries, here is a way to merge,
Assuming we have a population_df
with the columns lga
and population
, where column lga
is unique, and a water_df
with a bunch of columns including the column lga
(which need not be unique) and without the column population
, in pandas:
population_df = population_df.set_index('lga')
merged_df = water_df.join(population_df, on='lga')
We can wrap this so your API call is something like (since it's a post I'm showing the query params as post data):
POST /datasets/[water_df ID]/merge
DATA:
* other_dataset_id=[population_df ID]
* on=[join column name, e.g. 'lga']
@prabhasp Does this functionality sound like a good first start for implementation? @mejymejy We might want to make the API more distinct from the row merge
On Mon, Oct 29, 2012 at 6:14 PM, Peter Lubell-Doughtie < notifications@github.com> wrote:
If we assume the join column has unique entries, here a way to merge,
Assuming we have a population_df with the columns lga and population, where column lga is unique, and a water_df with a bunch of columns including the column lga (which need not be unique) and without the column population, in pandas:
population_df = population_df.set_index('lga')merged_df = water_df.join(population_df, on='lga')
We can wrap this so your API call is something like (since it's a post I'm showing the query params as post data):
POST /datasets/[water_df ID]/merge
DATA:
- other_dataset_id=[population_df ID]
- on=[join column name, e.g. 'lga']
@prabhasp https://github.com/prabhasp Does this functionality sound like a good first start for implementation?
"Does this functionality sound like a good first start for implementation" --> Great! We should have a good error message for when population_df's
lga
column isn't unique, but otherwise, this is exactly it.@mejymejy https://github.com/mejymejy We might want to make the API more district from the row merge
From pandas paper,
df1.join(df2, on='key')
looks pretty promising (where key is a shared column name, exactly as we need).