datacarpentry / python-ecology-lesson

Data Analysis and Visualization in Python for Ecologists
https://datacarpentry.org/python-ecology-lesson
Other
160 stars 309 forks source link

Bug BBQ: ggpy or plotnine when introducing grammar of graphics - Discussion #246

Closed stijnvanhoey closed 5 years ago

stijnvanhoey commented 6 years ago

The current lesson Making Plots With ggplot is using the package ggplot/ggpy. However, when checking the repository for recent developments, it says Latest commit b6d23c2 on Nov 20, 2016. At the same time, the plotnine package is still under active development and provides the same port of ggplot (grammar of graphics) for Python users. I've had good experience with plotnine.

Would it be a good idea to switch and change the content towards the implementation of the plotnine package? I'm willing to support the conversion (cfr. we already did something similar during a recent course), but want to check if this is worthwhile?

stijnvanhoey commented 6 years ago

Just a a test, I adapted the first section plotting-with-plotnine and the syntax does not seem to be very different. Although, as mentioned in #226 , the next sections maybe require some adapations anyway?

wrightaprilm commented 6 years ago

This is a really good question that I'd like some broader opinions on. We have the bug bonanza coming up, so now is a good time to have it. For context: the plotting lesson has always been a bit of a beast, and a lot of that is due to the fact that there's more diversity in what people do for plotting in python, compared to R. There's a fork of this repo that gets used in a semester-long course, doing plotting in Seaborn, for instance.

We switched over to ggplot a bit ago from MatPlotLib, for greater continuity with the R lesson. And that's something I had mixed feelings on at the time. Having the continuity is nice, since learners might use multiple lesson sources. But MatPlot is part of the NumFocus ecosystem, and has a sustainability plan, etc. Basically, heading off the issue we see now with ggplot, where there is not active development happening, and a change to Pandas or python that breaks the interface might not be handled in a timely manner.

There's pros and cons here. I prefer the idea of active maintenance, for responsiveness to change. But I'm also hesitant to have the community sink the time into switching if we end up in the same place - all set up with a new package that drops out of development.

I'd like to throw out a few questions for discussion as we prep for new lesson releases:

stijnvanhoey commented 6 years ago

Good point. I actually would split the discussion in two:

  1. Is there added value in introducing the concepts of Grammar of Graphics?

I would definitely say yes. It provides a very powerful way of creating graphs for typical data.frame data.

  1. If we want to introduce the Grammar of Graphics (GoG) concept, we have to go beyond matplotlib and need to make a choice in the package to introduce.

I'll try to give a small overview (notice, this is just a mall selection of the general landscape)

I would opt for plotnine:

Altair provides a more Pythonic way to GoG, but will not comfort matplotlib users and the syntax is different from the ggpy and ggplot2.

darencard commented 6 years ago

I used ggpy/ggplot a while back for a Python Genomics workshop before it got co-opted into core Carpentry lessons. It did the job of teaching GoG well enough, but even some of the basic plots had little visual bugs. Given this was roughly when support was discontinued, I'm sure these persist in some form.

Just last week used plotnine (anyone know what the name means???) for a workshop and though I didn't do that lesson and therefore didn't play with it a ton, it did seem a lot more stable and is really a direct drop-in replacement for ggplot2 in R. I vote for plotnine moving forward.

Regarding whether it is better to use matplotlib, the extensive, widely-used Python default, or GoG libraries, I lean towards adopting GoG. I think it is far more intuitive to students from the beginning, when taught correctly, and the default plots are very aesthetically pleasing right out of the box. They make even simple datasets look really nice, and I think that is important for getting students exciting about continuing to explore and learn after workshops. My only reservation is that matplotlib is the basis of so much in Python and has been the gold standard for so much longer than most other good plotting libraries, so it is a bit of a disservice to not expose students to it even a little.

ethanwhite commented 6 years ago

I agree with @stijnvanhoey's assessment and recommendation for switching to plotnine. I like the parallel with the R lessons and I also think that GoG makes it easier for beginners to create complicated plots which helps satisfy the "early wins for motivation" goal of the workshops.

It might also be worth noting that we don't have to switch over to the recommended plotnine notation and if we keep the import * and +\ line breaking we probably wouldn't need to touch as much code to make the switch.

stijnvanhoey commented 6 years ago

@ethanwhite the necessity of the alias is indeed technically not required when importing the entire namespace of plotnine. However, considering the good practice of handling the package namespaces (basically doe not import *) and the easiness to explore a library with an alias (p9. + TAB button really helps people to get to know a package), I would recommend to use the alias. Feel free to revise #248 to see how it looks like when converted.

ethanwhite commented 6 years ago

@stijnvanhoey I don't disagree on the best practice. My point was mostly that it is independent of the switch to plotnine. If that switch is, on it's own, basically as simple as changing the import that means there's not much cost to switching to the more full featured better maintained library.

stijnvanhoey commented 6 years ago

Indeed, good point.

wrightaprilm commented 6 years ago

Right, and that's where I'm mostly hanging up. We currently do GoG, and ideally, I think that should continue to be the paradigm. Most of the tools in the lessons (Pandas, Matplot in the SWC lessons are all under NumFocus) have some financial support for their development, and a roadmap for development. I don't really want to be doing this dance again in a year and changing things up on our instructors yet again. In the process of deciding, I'd like to hear about more is what the plans for ongoing library development and maintenance are. If there's significant interest in plotnine (vs. other GoG-like packages), then I'll tag in the maintainer (who, it looks like commented on #248) and ask some of those questions.

stijnvanhoey commented 6 years ago

Considering the similarity with the current used package, the current state of plotnine package and the need of a cleanup of the GoG lesson anyway, I would propose to make the switch (off course, after reviewing of the renewed content of PR #248). Keeping the current lesson, with a package not actively maintained anymore, is maybe worse than making the switch to plotnine, independent from the further development of the plotnine package? I've had positive experiences with teaching plotnine. Still, this is a suggestion and developing an alternative lesson would be fine as well.

With respect to the further development and maintenance of plotnine, I think @has2k1 can maybe comment directly?

has2k1 commented 6 years ago

Plotnine is actively maintained, and though the bug frequency is low I am still the only main developer.

The mid to long-term future of the project largely depends on the scientific python ecosystem. Depending on how Matplotlib evolves, plotnine should be become more extendable; on the grammar level it has the same extension capabilities as ggplot2 it is only held back by the plotting backend (Matplotlib). Other scipy packages will continue to sporadically influence a new features.

On the issue of migrating from ggplot/ggpy to plotnine, you should consider that ggpy is just grammar-like syntax around Matplotlib, so, users get some of the declarative clarity of a plotting grammar and little of the flexibility. e.g. limited to no "composability" of geoms and stats , lack of integration with a proper scaling backend.

wrightaprilm commented 6 years ago

That sounds reasonable to me. We have a Bug BBQ coming up in two weeks, and the last time we had one, we had really nice buy-in from both the Carpentries community and the larger Python community. So I'll leave this discussion open so people can see the discussion.

Next week, I'll merge the lesson, and create a couple issues, one for cross-platform testing of the lesson, and one for general reading over and and finessing. I think this is the best way to tackle such a big change - it would be great to have additional eyes on before merge, but a lot of Carpentries contributors aren't that familiar with Git and GitHub, and we run into issues working across forks and branches.

katrintirok commented 6 years ago

I neither use ggplot nor plotnine when working with python, but instead more or less basic matplotlib.pyplot, which still seems to be standard for scientific computing in python. When I recently taught the DC ecology lesson with python together with a colleague who mainly uses python, he was surprised to see 'an R library' within the material and was not comfortable to teach ggplot. I am familiar with ggplot from working with R, and I really like its features to easily build complex graphs, however, maybe it would not be a bad idea to teach a more 'pythonian' way (if there is any), i.e. matplotlib for plotting, in the python carpentry workshops so people get the basics to build upon. When using matplotlib together with for loops one can go from simple to complex graphs quickly ... Could also have episodes with more or less the same content using different libraries (since there are always many different possible ways) and instructors could choose which library to teach. For the plotting - basic plotting from the pandas dataframes is currently introduced in the 'starting-with-data' episode and the plotting lesson should build upon that.

stijnvanhoey commented 6 years ago

Maybe, instead of focusing on looking for a ggplot2 alternative, we could put the focus of the learning objectives of Making plots... episode on the introduction of the Grammar of graphics (GoG), a valuable skill in general. And with respect to this objective, plotnine and altair are the 2 main candidates (as far as I'm aware) of packages that fully support te idea of GoG (I would say altair is the more pythonic of both, but doe not link to matplotlib).

As plotnine is built upon matplotlib (just as the pandas plotting), we have common ground there: 'make your plot with the Pandas plotting of with plotnine and further customize with the power of matplotlib.' (kind of best of both worlds). As such, alle the visualization material links with eachother.

katrintirok commented 6 years ago

That sounds good, introducing GoG as a general concept and then introduce how to realise it in python. The fact that plotnine is built upon matplotlib and can be customised with matplotlib is a plus, I so far have not used plotnine, but will have a better look at the lesson this week.

Nevertheless, we should still think about how to introduce the basics of matplotlib with fig and ax objects for learners new to python. When looking for help to specific problems online, solutions using matplotlib with fig and ax come up a lot.

Also, not all python instructors may be familiar with plotnine (or ggplot/ggpy) - any experience on that, since the switch from teaching matplotlib?

From the R lessons, participants are usually impressed and have fun with the ggplot lesson, so getting a similar effect in the python lessons would be cool I guess.

wrightaprilm commented 6 years ago

The problem with this lesson is always going to be that the Python plotting ecosystem is going to be fragmented in a way that R isn't. So there will always be instructors who aren't familiar with different aspects of the way Python plots are built. I'm the only person I know who ever uses MatPlotLib, for example, beyond simply making more axes of a figure. We haven't had any specific feedback from instructors that they do or don't like the use of ggplot vs. matplot, but I do know of one fork that gets taught in a semester-long class using seaborn. Because I was very involved in the initial drafting of the lessons, I try to be a little more hands-off when it comes to maintaining them, so I the lessons aren't too much of "my" thing. Instructors tend not to contribute back much unless they have really strong feelings. So the net result is that it's pretty easy for one or a couple contributors with strong opinions to drive the materials towards or away from certain tools.

I would really prefer to stick with a GoG approach, and like you noted, @katrintirok, the wow factor to the ggplot lesson is nice. At the same time, I do think you're right - I interact with matplotlib's ax and fig options quite a bit. I've been trying to overhaul the putting it all together lesson for a bit (I'll probably get to that this afternoon). Perhaps it can go there?

katrintirok commented 6 years ago

Yes, let's stick with GoG. Having some advanced features in the putting it all together lesson sounds good. I agree, python kind of appears more diversified than R, or let's say there are not as clear preferences for certain libraries.

stijnvanhoey commented 6 years ago

+1 for the 'GoG first' approach and I agree on the woow factor. If we do the fig/ax of matplotlib interaction, this could indeed be good for the all together lesson.

has2k1 commented 6 years ago

The problem with this lesson is always going to be that the Python plotting ecosystem is going to be fragmented in a way that R isn't.

The issue is slightly deeper, i.e the python tools are not coherent. R has the tidyverse where by the cleaning, manipulation, modelling and plotting of data is with tools built around the tidy data concept. In that environment GoG plotting is not out of place. In python users sometimes may feel like they are manipulating data just so it can be plotted with a GoG package, so using a grammar may seem odd/unnatural.

Though the python ecosystem is lacking, the solution is to introduce the tidy data concept early so that tools that work with tidy data are never out of place. Plus, as tidy data makes thinking about data clearer, having users acquire the sensibilities to view it a best practice should help nudge the python ecosystem further along that direction.

wrightaprilm commented 5 years ago

I was looking back, and I think we're solved on this issue. I'm going to close it.