GiovineItalia / Gadfly.jl

Crafty statistical graphics for Julia.
http://gadflyjl.org/stable/
Other
1.9k stars 250 forks source link

Support Tables or TableTraits or both #1292

Open davidanthoff opened 5 years ago

davidanthoff commented 5 years ago

IterableTables.jl used to have Gadfly integration pre julia 1.0. What that enabled was that one could pass any tabular source, not just a DataFrame to Gadfly's plot and things would just work. That included support for lots of different table types (TypedTables, IndexedTables, JuliaDB etc.), file IO (CSV, Excel, Feather, lots of stats files) and Query results. That support unfortunately didn't survive the transition to julia 1.0. I just looked into reenabling it, but for various reasons that would be more complicated with the recent changes to Gadfly.

But, the whole idea of having the integration live in IterableTables.jl was a pretty ugly hack in any case, and so I'm wondering whether it would make more sense to just make Gadfly.jl natively support TableTraits.jl/Tables.jl out-of-the box. Is that something the maintainers of Gadfly would consider?

Broadly speaking I believe the strategy in such a case would be to completely remove the DataFrames integration code here in the package, and instead just support any TableTraits/Tables source (which includes DataFrame, so no functionality would be lost). Just as an example, VegaLite.jl has used that strategy for more than a year now, without a single hiccup or problem and great integration with lots of different data sources.

I think this also has the potential to greatly help with load times of the package, at least medium term. I think a consensus is emerging that Requires.jl increases load times of packages quite a bit. In the scenario I'm suggesting, Requires.jl would no longer be needed at all for the data source handling. Instead, Gadfly.jl would take a normal dependency on a very small, fast loading table interface package (which I'm 99% certain would load much faster than Requires.jl). Of course at that point there is still the TerminalExtensions use-case for Requires.jl, but I'm fairly confident that one might be able to get rid of this as well, at least medium term (with e.g. https://github.com/JuliaLang/julia/pull/30922). So I think there is a path that would allow Gadfly.jl to drop the Requires.jl dependency, and that should help with package load times.

There are currently two strategies for getting this kind of integration going:

  1. You use Tables.jl. On the upside you get full interop with any TableTraits.jl and any Tables.jl source that way. On the downside, the package currently has a dependency on Requires.jl and loads somewhat slowly (it is pretty fast still, but slower than maybe needed), so at least in the short run it actually wouldn't allow you to get rid of the Requires.jl dependency. I believe that this will change shortly, though, so probably not a big issue. Generally that package has more churn and more stuff in it than option 2, and that is probably something you just need to evaluate, whether you want to take a dependency on that or not.
  2. You directly use just TableTraits.jl (this is the strategy in VegaLite.jl). You get full interop with any TableTraits.jl source that way, plus in theory most Tables.jl sources should also be compatible (but I think in practice that might not be the case right now). On the upside, the packages you need to depend on for that probably load faster than Tables.jl right now (but I think that benefit will disappear once Tables.jl has shed its dependency on Requires.jl), and it is probably fair to say that the packages you would depend on have a lot less churn than Tables.jl (most of the code there has been stable for 1-2 years). On the downside, you would probably lose integration with a couple of Tables.jl sources that don't currently provide the TableTraits.jl interface.

I should also say that I think @quinnj and I might be able to further shrink/unify the TableTraits.jl/Tables.jl story at juliacon, so hopefully those kind of choices will disappear in the near future.

My gut feeling is that maybe the best strategy would be to start a PR that adds Tables.jl integration, but then hold off merging until Tables.jl gets rid of the Requires.jl dependency.

bjarthur commented 5 years ago

if a unification of a generic tables API is in the works, i'm inclined to hold off on any changes to Gadfly until then.

also, what was the problem with updating support for Gadfly in IterableTables?

cmcaine commented 5 years ago

Tables.jl no longer depends on Requires.jl, by the way.

bjarthur commented 5 years ago

what's the latest on unifying the TableTraits / Tables API?

quinnj commented 5 years ago

David and I didn't get a chance to talk about it at JuliaCon, which we had planned.

The reality though, is that Tables.jl is already a superset of TableTraits.jl in terms of a pure table interface. i.e. if you implement the Tables.jl interface, you get TableTraits "for free" by then defining IteratorInterfaceExtensions.getiterator(x::MyTable) = Tables.datavaluerows(x), so that's what I've been encouraging people to do, as it gives you the best of both worlds, especially given the current reliance on DataValues of TableTraits, whereas the rest of the ecosystem just relies on the builtin missing, i.e. normally it's more natural and simpler to implement Tables.jl than jump through DataValue-hoops for TableTraits. If someone spells out how exactly they want to use Tables.jl in this package, I'm happy to put a quick PR together to show the basics, the nice thing is it's usually not more than a 10-15 minute exercise.

davidanthoff commented 5 years ago

Yeah, I think if Gadfly would simply use either Tables.rows or Tables.columns, it would work with all sources that exist today (i.e. both Tables.jl and TableTraits.jl sources), and I would also think that any unification of table interfaces would not break a package that simply used one of those two functions.

tlnagy commented 5 years ago

I'm very much in favor of replacing our lazy loading of DataFrames into Gadfly via Requires with the standard table interface. Ideally we would then remove our dependency on Requires completely afterwords.

bjarthur commented 5 years ago

we also use Requires for TerminalExtensions

OkonSamuel commented 4 years ago

Sorry wanted to find out the status of this issue

bjarthur commented 4 years ago

nothing has changed in Gadfly. what's the latest with a unified table interface?

OkonSamuel commented 4 years ago

I guess it's still distinct as we have TableTraits.jl and Tables.jl. There has been no unification as to the best of my knowledge but Tables.jl no longer has a dependency on Requires. @quinnj am i correct?

quinnj commented 4 years ago

I think David's comment from last year is still accurate and the best way forward:

Yeah, I think if Gadfly would simply use either Tables.rows or Tables.columns

Tables.jl will automatically work with any TableTraits.jl source, and do the DataValue unwrapping automatically. Tables.jl has also cleaned up it's dependencies a lot over the last 1.5 years, and doesn't use Requires.jl anymore.

So all-in-all, I think if someone is interested in doing the work here, they could write a single generic table plotting interface that just iterated over rows from Tables.rows(x) and a single dependency on Tables.jl will do the job.