data-8 / textbook

The textbook Computational and Inferential Thinking: The Foundations of Data Science
http://www.inferentialthinking.com
771 stars 273 forks source link

Consider alternative frameworks for data structures? #187

Open cboettig opened 4 months ago

cboettig commented 4 months ago

For lots of good reasons data-8 has, as I understand, always relied on Berkeley's own datascience package which offers far more intuitive/pythonic syntax to pandas. While I agree with all the pedagogical justification there relative to pandas, as you all probably know there are now much more performant and pythonic alternatives displacing pandas dominance, specifically polars and ibis.

I think these provide a syntax that is closer to datascience than pandas, and is more nicely aligned with and informed by database theory (and indeed can be translated directly to SQL). I know this wouldn't be a small overhaul, but I think it could be a substantial improvement.

Maybe it would make more sense to migrate data100 from pandas to polars first?

davidwagner commented 2 months ago

That does sound intriguing and promising! I don't know polars and ibis well enough to judge whether this would be an improvement, or have the resources to take on this major shift, but it does sound like the sort of change that might be a big improvement. It would be great to be able to move to run on existing standard libraries rather than relying on the datascience package, if the existing libraries are easy enough to learn and meet the pedagogical goals.

cboettig commented 2 months ago

@davidwagner very cool! @jegonzal and @fperez were discussing this a bit in the context of data-100 too and may have more insight. From what I understand, it sounds like Wes created ibis to address these issues they had in pandas in the first place [1].

fperez commented 2 months ago

Yes! I haven't had time to dig into the details of polars vs ibis, and I'm not even sure if they occupy quite the same space. But polars is definitely rapidly rising as a viable alternative to pandas, and I think we'd gain a ton from exploring this.

I also think that a combination of one/two GSIs + AI-assisted translation could make the porting of at least the base material a reasonable lift, with the faculty/textbook authors having to only do a final review of the resulting product.

It's not trivial, but it could be done in parallel over a semester if DSUS assigns one or two GSIs to the job.