joins - Githubissues

shriram commented 1 year ago

@MithiJ will see what joins Pandas has and suggest what makes sense for us.

Mithi, please tag Joe and Kathi once they have joined the repo.

MithiJ commented 1 year ago

Tagging @kfisler here too.

concat() - Instead of .addrow() and .addcolumn() you can concatenate tables. I think this will be easy to implement but seems unnecessary as long as we can recursively add rows or columns from one table into another. There is no further consideration of joining based on indexes or columns, preserving old values or new ones, preserving old or new indices, etc.

Merge() - Here there are 5 options - inner, outer, cross, left, or right. These are the same options for .join(). In merge, you can specify a "how" parameter that explains how to merge based on these 5 options. The inner join takes the intersection of the rows. Outer join takes the union of the rows and adds in NaN for any values that only appear in the left or right table. Left and right joins respectively only output the rows from their table with corresponding values from the other table. Thus, they can use NaN if no corresponding value is present in the other table.

Some notable design considerations:

left vs right indices -> Do we want to use the left table's indexing system or the right table's system? This seems to be useful in pandas since they allow you to name the joined tables with unique identifiers so that you can later separate/extract them after concatenating. You can also choose to preserve indexing (a table with 3 rows joined with another table with 3 rows can have indices [0, 1, 2, 0, 1, 2] unless otherwise specified). I found this to be needlessly complicated, especially for educational table programming contexts. Further, other functions in pyret (such as .addrow() for example) create a new table with the necessary changes made. Thus, we can create a new table that has revised indexing (like [0,1, 2, 3, 4, 5]) and don't allow indexing based on indices stored before the join.
There are options to allow a one-to-many, many-to-many, many-to-one and one-to-one joins. The validate keyword in merge function checks if the values are unique in the tables mentioned as "one". merge allows many-to-one joins where columns with the same values are collapsed into one. This could be an add-on feature.

Thus, it seems useful to have a merge function that takes in a "how" parameter outlining the type of merge and an "on" parameter outlining the column or index to merge on and the 2 relevant tables to be merged. This function would create a new table with elements from both tables as specified and return this newly created table.

kfisler commented 1 year ago

Concat would be useful. Yes, one could implement that with add-row and recursion, but we want students to be able to do interesting table manipulation before they get to recursion.

As for merge: indexing indeed adds complexity. I'd rather us first agree on what operations we want to let students do with indices, then we can decide what index support makes sense. Pandas uses indices to let programmers access rows, for use in either lookup or update operations.

I definitely see value to offering an operation that lets people locate a row without explicitly filtering on a predicate. We've talked about operations like "find me the row in which column c has value x". To do more than that, we need a notion of tagging that is not part of the row data itself. Would we make those indices visible? I see confusion being induced either way.

From the perspective of CS111, I like keeping things clean, which means not having indices other than a column with unique values. Shriram's mileage may differ based on 19. @shriram ?

shriram commented 1 year ago

I think our needs are sufficiently similar. We can always add more things later.

MithiJ / pyret-lang

joins #4