VinylRecords / Vinyl

Extensible Records for Haskell. Pull requests welcome! Come visit us on #vinyl on freenode.
http://hackage.haskell.org/package/vinyl
MIT License
262 stars 48 forks source link

how does Vinyl fare on B2T2? #172

Open shriram opened 3 months ago

shriram commented 3 months ago

Hi folks — this may not be of interest to you, but just popping this in here in case it is.

Vinyl looks like it could be a good fit for properly typing B2T2, a benchmark for typed tabular programming:

https://github.com/brownplt/b2t2/

We'd certainly be very curious to see the result if you're interested in showing how far you get on the benchmark. In turn, because it's an independently-defined benchmark, it may also help you make a case for the strength and flexibility of Vinyl's design (as opposed to a benchmark you design yourself). Finally, it would show that one can have a fully typed, and hence statically safe, solution to the kinds of programs people write in dynamic languages like Python and R.

acowley commented 3 months ago

This is an excellent effort, @shriram, thank you for tagging this repo! I think this would be of relevance to Frames where we augment Vinyl with various functionality to aid with data frame work. That said, both this repo and that are fairly low traffic at the moment as I have been a poor maintainer of late due to time constraints. But see this issue for some reflections on API ergonomics.

To set expectations, Vinyl was an investigation into manipulating records more fluidly in Haskell. Frames was begun to demonstrate that one could have a workflow where Vinyl records are defined based on a CSV data file: field names are drawn from a CSV header row, and the column types are heuristically inferred from the data file itself. The idea being that the compiler would help keep one-off data analysis programs in sync with the data.

In order to make the data analysis part more efficient, Frames will help with switching between a columnar representation (struct of arrays) and a more naive row-based array of structs representation. That said, the flexibility of filtering and grouping columns in a data frame in Frames does not compete with what is available in the R ecosystem in particular.

I'm interested in seeing how B2T2 can help us gain some insight into what underlying algebraic operations useful manipulations are built upon. For instance, we added melt to Frames, but, while we might dress it up, it was built to match what R did more so than coming from any mathematical definition.

Setting aside limitations with type level programming in Haskell, an enduring impression I developed over the years of working with Frames is that it can be hard to avoid clunky types. For instance, beginning with a set of (String, Type) pairs feels a comfortable way to think about a record or table (accepting String as an identifier). You might then give that set of fields a name as you would a table, e.g. Person. But now you manipulate the structure of the data by removing a column, say, and you have something that gets written in Frames as something like RDel Person Age to indicate a structure isomorphic to Person with its Age column removed. But seeing, much less writing, that type is very high friction. You might come up with a name, but that is famously hard.

In data analysis scripts in R or Python, you can have similar challenges with variable naming, were you might load your data as people <- read.csv('my_data.csv') and then change the people variable to point to a table with the age column removed because you don't care about the old table anymore. If you want to retain a reference to the original data, maybe you name this new thing people2 or people_no_age, but you're rarely happy.

Much like how in Haskell we are able to avoid naming temporary values by leaning heavily on composition aided by the type checker, I found in looking at my own work with R, Python, and SQL, that I tended to either avoid naming the temporary thing by compressing things into a single expression, or I'd have a kind of linearity of naming where once data was consumed, the old name was free to refer to a value of a different type. I haven't gone back to try to demonstrate that kind of fluidity in Haskell, but I'd like to.