MithiJ / pyret-lang

The Pyret language.
Other
0 stars 0 forks source link

joins #4

Open shriram opened 1 year ago

shriram commented 1 year ago

@MithiJ will see what joins Pandas has and suggest what makes sense for us.

Mithi, please tag Joe and Kathi once they have joined the repo.

MithiJ commented 1 year ago

Tagging @kfisler here too.

concat() - Instead of .addrow() and .addcolumn() you can concatenate tables. I think this will be easy to implement but seems unnecessary as long as we can recursively add rows or columns from one table into another. There is no further consideration of joining based on indexes or columns, preserving old values or new ones, preserving old or new indices, etc.

Merge() - Here there are 5 options - inner, outer, cross, left, or right. These are the same options for .join(). In merge, you can specify a "how" parameter that explains how to merge based on these 5 options. The inner join takes the intersection of the rows. Outer join takes the union of the rows and adds in NaN for any values that only appear in the left or right table. Left and right joins respectively only output the rows from their table with corresponding values from the other table. Thus, they can use NaN if no corresponding value is present in the other table.

Some notable design considerations:

Thus, it seems useful to have a merge function that takes in a "how" parameter outlining the type of merge and an "on" parameter outlining the column or index to merge on and the 2 relevant tables to be merged. This function would create a new table with elements from both tables as specified and return this newly created table.

kfisler commented 1 year ago

Concat would be useful. Yes, one could implement that with add-row and recursion, but we want students to be able to do interesting table manipulation before they get to recursion.

As for merge: indexing indeed adds complexity. I'd rather us first agree on what operations we want to let students do with indices, then we can decide what index support makes sense. Pandas uses indices to let programmers access rows, for use in either lookup or update operations.

I definitely see value to offering an operation that lets people locate a row without explicitly filtering on a predicate. We've talked about operations like "find me the row in which column c has value x". To do more than that, we need a notion of tagging that is not part of the row data itself. Would we make those indices visible? I see confusion being induced either way.

From the perspective of CS111, I like keeping things clean, which means not having indices other than a column with unique values. Shriram's mileage may differ based on 19. @shriram ?

shriram commented 1 year ago

I think our needs are sufficiently similar. We can always add more things later.