JuliaData / Tables.jl

An interface for tables in Julia
https://juliadata.github.io/Tables.jl/stable/
MIT License
299 stars 52 forks source link

TableTraits.jl/Query.jl integration issues #7

Closed davidanthoff closed 5 years ago

davidanthoff commented 5 years ago

I just had a quick look through the integration, and here are some problems I spotted:

  1. This line will break if there is a case of EltypeUnknown, I believe? This is one of the new changes for the 0.7/1.0 version of TableTraits.jl: an iterable table can return EltypeUnknown, and a sink needs to handle that.
  2. In general I think this package should not assume anything about Enumerable or depend on it, that is really an implementation detail. The proper way to consume a query result is to just follow the iterable tables interface for consuming it. The current version might work right now for some queries, but it won't work for any other Query.jl backend, it doesn't work for all the other TableTraits.jl sources there are (all the file IO etc.), and in general one should treat all the Enumerable types as non-public and not rely on their existence (i.e. all details about that part of Query.jl might break at any moment).
  3. How would this turn a Tables.jl source into an iterable table? Would folks have to manually define getiterator etc.? But then they have to take a dependency on the Queryverse stuff, which I think you wanted to avoid? I didn't look very long, so maybe I missed something there?

I haven't really reviewed this carefully, so there might be more stuff coming :) I also have very little time the next 1-2 weeks, semester started and that is always a tricky phase.

quinnj commented 5 years ago
  1. This is a fair point and one that I've definitely pondered over; the conclusion I came to for the moment is that I'm not actually aware of any cases where sources will return EltypeUnknown, particularly in the table world. I certainly understand the use-case in the more general Query world where you might be dealing w/ all sorts of kinds of iterators, but at least in the table world, I wonder if there actually would be a case where a table source didn't know its own schema? I remember seeing this change in TableTraits, but how is it actually planned to be handled? Does each sink have to account for that case specifically? Would you provide some kind of helper function to "collect" that case? It's certainly something I'd like to keep discussing/pondering, but also not an urgent issue as far as I'm aware.
  2. Hmmm, I didn't realize Enumerable was such an implementation detail. The main idea here was just to find some kind of "post-hook" for tables coming out of Query/QueryOperators where we could do the automatic DataValue-unwrapping. For now, this works quite well for those cases specifically, but we'll have to think through a more robust solution here.
  3. Yes, the idea is that all a Tables.jl source would need to do is define IteratorInterfaceExtensions.getiterator(x::MyTable) = Tables.getdatavaluerows(x) and that's it. Luckily, IteratorInterfaceExtensions is about the lightest dependency there is, so I don't think it will be too hard convincing folks to add a one-liner for easy queryverse integration.
quinnj commented 5 years ago

Thinking more about number 2. I think I'm actually ok moving forward w/ the current implementation based on Enumerable, here's my thinking:

If anything, it makes me realize that the datavalues.jl code and assumption that Tables.jl packages will do IteratorInterfaceExtensions.getiterator(x::MyTable) = Tables.getdatavaluerows(x) might not be the best long-term solution. As you mention, there might be other Query backends at some point and they might deal perfectly well w/ Union{T, Missing} and not require DataValues at all. The key for me here is that I think it's only the Enumerable(Map) implementation that really requires DataValues, so how do we best keep the requirement close to the implementation.

davidanthoff commented 5 years ago

On 1: That is going to hit very soon (essentially the next Query.jl release, so maybe not in two weeks, but certainly in something like 3-4 weeks). Whether one will get a EltypeUnknown or not is unrelated to the table/not table question, i.e. one will get that for tables as well. Essentially this will happen whenever type inference gives up on a query because it is too complex, and that can happen with queries that produce a table at the end. That whole scenario is actually quite common, and right now things just stop working in that case, so moving to this new model is a high priority for me. Yes, sinks will have to deal with this, in the same way that any generic iterator consumer today should probably deal with this. For sinks that use TableTraitsUtils.jl to implement the sink behavior this will all happen transparently and automatic (essentially I plan to just use @piever's collection code from IterableTables.jl in that place).

I don't think we should put a solution for this off, in my mind this is the biggest mismatch right now between Tables.jl and Queryverse.jl. I think your plan is that essentially sinks can implement the Tables.jl interface and become a TableTraits.jl sink "for free", but here is the tricky aspect of that: I don't see how that can work if the Tables.jl API fundamentally "promises" sinks that they can get hold of a schema before they start to fetch the data. That is at the core an assumption about sources that doesn't hold for Queryverse sources.

I think the only option there is to make the schema function optional in Tables.jl, i.e. make it clear that a sink needs to work even if there is no schema. I know that is not nice, and I was super reluctant to make that change in TableTraits.jl (I tried to avoid that for a long time and looked into a LOT of possible solutions), but I think at the end there is no way around that if an interface wants to cover all possible table sources.

Here is why I think this should not be put off: if you release a version of Tables.jl that promises sinks that they can get a schema, and then you try to change that later, you will have to break sinks, and I think that is just too disruptive for something that you hope will be very deep in the dependency graph.

davidanthoff commented 5 years ago

On 2: Again, I really don't think you should take a dependency on Enumerable. I can guarantee you that this will break and lead to lots of instability, it is an implementation detail and we will break that interface as we iterate Query.jl (not just when we add new backends). And because the integration code lives behind a Requires.jl story, it would be super hard for you to deal with that (you essentially can't use the package manager version resolution for this at all). The current Tables.jl implementation also only works for Query.jl, but none of the other Queryverse sources (CSVFiles, FeatherFiles, ExcelFiles, StatFiles, ParquetFiles etc.).

Also note that the reason one should consume a Queryverse.jl table via the getiterator story is not the DataValue story (it is a nice side effect that we can solve the bridging problem this way), but that is really the core of the monadic design of LINQ. There is one core, brilliant idea at the center of LINQ (and I just copied that for the Queryverse), and it is that one can get crazy composition if one sticks with an iterator Monad as the core bridge between everything. If we start to now do things like this dispatch here on Enumerable, we really, really break that core idea in a bad way. There are some really brilliant papers by folks like Erik Meijer on this, and Microsoft's channel 9 has a lot of cool videos that explain this design, I think with Brian Beckman, Bart De Smet etc.

Having said all this, I think there might be a way out of this now that you added the istable function, but I'll have to think a bit more about that...

quinnj commented 5 years ago

Thanks for the comments @davidanthoff; there's certainly some things to noodle on there. I'm going to study the collect code in IndexedTables and think a little more about how missing schemas affect the Tables.jl story. In a purist vein, I have to wonder that if an object can't give a reliable schema, is it really a table? I know we like to get fancy about transforms and things and letting their types get "pulled" through collection, but I think we do have to ask ourselves if those should really be considered tables. Projection tables, if you will. You have to remember that I come from the database/file formats world of tabular data, so I don't always think about these "transform" kinds of use-cases. W/ databases, files, and in-memory structures, there's always a schema, period.

But realistically, I don't think that's convenient for users. They want to do lazy projections & filtering over tables and still send those results to any sink.

One thing I've been contemplating is whether "schema" should only be required on the result of Tables.rows(x) or Tables.columns(x); those are, after all, the objects (sometimes very different from the original object) actually producing the values, and so it seems would have easier access to their schemas. But I still don't think that helps the projection case when inference fails; in that case, you really have to just start iterating and see what you get.

Kk, let me noodle on this a bit more.

piever commented 5 years ago

I think David makes a very good point that this should allow for tables with unknown eltype. Using the same trick as map does, one can collect one of this row iterators without loss of performance (meaning, if the eltype is constant, if it isn't then of course the algorithm stops in the middle and copy the offending column to a larger type): the code currently lives here.

One plan would be:

piever commented 5 years ago

In general I feel that a few methods of IndexedTables (collect_columns to get a NamedTuple of vectors from an iterator of NamedTuples) as well as the iteration method (to iterate rows efficiently from a NamedTuple of vectors) should actually live here and be fallbacks for columns and rows respectively.

quinnj commented 5 years ago

See the code starting here for the current rows and columns fallbacks. I feel like they're in pretty good shape, but obviously as we're discussing here, they don't explicitly handle the unknown eltype case. For the RowIterator, it would be pretty straightforward, but yes, we essentially would want a version of buildcolumns that takes the collect strategy.

piever commented 5 years ago

The RowIterator approach is very interesting and different from IndexedTables. IndexedTables uses generated functions to materialize the whole NamedTuple efficiently, whereas you use a "lazy version" where the field only materializes when you call getproperty (you could even define a setproperty! there, that would modify the column in place, but that's a separate discussion). I'm curious whether IndexedTable should be ported to using this approach. To get the best of both worlds, here there could maybe be a constructor to create the NamedTuple from your "lazy row" (but I'm afraid it needs generated functions to be efficient and I'm never sure whether having an actual NamedTuple gives you an advantage).

Should I make a PR to use the technique from IndexedTables to generalize buildcolumns to the non constant case?

quinnj commented 5 years ago

Yes, the lazy row is actually pretty crucial to the optimizations of the Tables.jl interface; it allows only reading/materializing the columns that are desired, in, for example, a Query.jl @map call. It's also much more efficient in the case of DataFrames, because it already defines its own DataFrameRow type which is a view into the original DataFrame.

I'd certainly welcome a PR to collaborate on the collect stuff!

piever commented 5 years ago

On it! I think it should be not such a large change as the setup in Tables is already quite nice. I only have one doubt in that buildcolumns of an iterator could sometime "strictify" things (as it relies on the types if finds, not on the declared ones). This is the analogous of collect in Base:

julia> v = Any[1, 2, 3];

julia> collect(v)
3-element Array{Any,1}:
 1
 2
 3

julia> collect(i for i in v)
3-element Array{Int64,1}:
 1
 2
 3

Are we OK with that behavior?

piever commented 5 years ago

I actually think there are two distinct behavior that can be handled with two different dispatches.

One is buildcolumns(sch, rowitr) where the scheme is known and rowitr doesn't need to iterate NamedTuples because the NamedTuple structure is known thanks to sch, and one where the scheme is not known, so buildcolumns(rowitr) but then we demand that rowitr iterates actual NamedTuple which will be used instead of sch. If the "row-type" is constant, this will work just as well. I'm unsure whether we should support a non NamedTuple iterator without a scheme.

quinnj commented 5 years ago

Hmmm, that seems less than ideal. I'm not quite sure why we need to require a NamedTuple iterator in the UnknownEltype case? We should be able to do something very "collect"-like by using propertynames(row) and getproperty(row, nm) for accessing values? Does the IndexedTable code currently rely on at least having the NamedTuple types when collecting?

piever commented 5 years ago

IndexedTables relies on having the NamedTuple type, though this requirement can probably be relaxed a bit. So for you a row is basically any struct (which I like a lot!) and propertynames determines what are the column names? The key point IMO is that if row is a NamedTuple and nt is a NamedTuple{<:Vector} with the columns, I can create a function that, based only on the type of row tells me whether all the types are compatible (so in the type stable case this gets optimized away) and:

With a general lazy row type, I can't tell from the type of the row what is the type of the fields and I'm afraid this can cause problems (at least it prohibits to translate literally the plan sketched above). I see two possible ways forward:

  1. Add some trait fieldtypes(::Type{T}) where {T} that for a row type T returns the tuple of types of the fields.
  2. Try to remove the requirement of knowing the types of elements from the type T of the row..

2) is a bit more challenging then the current implementation but is probably the principled way forward (and I think is much more robust to the issue discussed on Discourse of the combinatorial explosion of possible missing combinations in each row).

I think a way to do 2) would be to add a fallback method to the function add!(...) such that if the element is not of a subtype as the column, don't do anything and return nothing. Then the function unroll would also be such that if a call to f returns nothing the whole function exits and returns some way of signaling that adding the row failed (and where it failed: for example we could return nothing in case of success and the first field where it fails in case of insuccess). All of this should be optimized away in the type stable case if my intuition is correct. Then the main function, if it gets a nothing from unroll would know that things failed, so it would widen the columns (NamedTuple{Vector}) and try again.

What do you think? If you are in favor of 2), I could actually use some help in changing unroll to do the desired thing (return the field that failed - meaning where f returned nothing) as I don't completely understand that code (in particular the runlength part). If this does not have a performance impact in the type stable case, it could be a useful change anyway.

quinnj commented 5 years ago

I made a start at handling missing schemas; I didn't do any fancy unrolling yet, but just wanted to get a first draft up of what I've been thinking: https://github.com/JuliaData/Tables.jl/pull/10

davidanthoff commented 5 years ago

I just took a very brief look at the current version of the code (I still don't have much time, I'm afraid). The biggest issue I still see is the dependency on Enumerable, I really don't want us to go forward with that and release a version that does this. One core design goal in the Queryverse is to never have breaking releases anywhere (except when there is a breaking julia release). The current design in TableTraits.jl was carefully chosen to enable that, and it has worked out well over the past two years (I think I literally had not a single breaking release anywhere ever). With the current design here, though, that will no longer work and even just the next release of Query.jl that we are working on will break this current version here.

So from my point of view I think we should either a) try to fix this before Tables.jl gets released widely or b) hold off on the integration for now and go with something like this until we have figured this out properly. I think this mostly comes down to a timeline decision. b) could be done very quickly and you could release Tables.jl soon, a) is a bit less clear to me.

quinnj commented 5 years ago

@davidanthoff, can you comment further on the upcoming Query redesign? Is there a branch somewhere I could look at? Any design notes or public information on how things are being redesigned? Are you actually changing the Enumerable definition? If we can avoid that, then I still think we should be ok; the code is now setup so that if the current definition of Enumerable exists, then the code will load, otherwise, it's ignored. I would definitely be interested in the planned redesign and how we can hook Tables.jl in.

quinnj commented 5 years ago

It would also be helpful to discuss timeline: is the Query redesign coming in the next few weeks or are we talking few months? If the former, I can understand the hesitation, but for the latter, I feel like that gives us enough time to adapt appropriately while releasing Tables.jl sooner.

davidanthoff commented 5 years ago

No, there is nothing public on our plans available right now. The rough story is that we will have many more types than just the Enumerable* story, and they won't derive from Enumerable. We might also change the story around Enumerable, i.e. that type might go away, but that is more speculative right now. I don't have a concrete timeline, but we gained some new team members here on campus for this, so I expect things to move faster than in the past.

The problem shows itself today already, though. I just realized that this is not restricted to the future plans we have. The current design right now works with Query.jl, but produces incorrect results with any of the other Queryverse sources. Essentially what seems to be happening is that the DataValues-conversion in this package here only kicks in for Enumerable* types. That won't work, there are many other source types in the Queryverse in other packages that also iterate rows that use DataValue, and the conversion will need to apply to them as well. In general, the conversion should really apply to any source that you call getiterator on. Now, once that is fixed and the conversion kicks in for all those sources, I don't think there is any need for taking a dependency on Enumerable any more in any case: once the conversion works for anything that you start to iterate with getiterator, there is no need to special casing Enumerable anymore, right?

If I look at the current DataFrame constructor that integrates with Tables.jl, I think generally the case of a TableTraits.jl source will have to be handled outside of the istable(x) conditional block: that predicate is not well defined for TableTraits.jl sources, because it fundamentally is not decidable. So I think the proper strategy there is to just handle all that roughly in the block that is here (although the call to Base.isiterable should really be replaced with the construction based on TableTraits.isiterabletable that I showed in that PR at some point).

I feel like that gives us enough time to adapt appropriately while releasing Tables.jl sooner.

So this is a road I don't want to go down. I totally understand if you feel that in general you would want to play with the Tables.jl design over the next couple of months, but the Queryverse is not in that state. TableTraits.jl core is done and has been done for a long time, and I care a great deal about stability and a non-breaking evolution of the packages in the Queryverse. If we can pull of the integration properly at this stage, without hacks or things we know we must adjust in the future, I'm game. But if not, I would prefer that we wait with the integration until things on the Tables.jl side have settled down.

quinnj commented 5 years ago

Ah, excellent thoughts/points. Great to hear you have more resources to help push things forward.

Thanks for taking the time to describe the Enumerable issue, I understand better now and you're right that solution obviously wouldn't work for a number of cases. I've improved the integration w/ the PR here. The idea there is that a source type will still "hook in" to queryverse by just defining:

IteratorInterfaceExtensions.getiterator(x::MyTable) = Tables.datavaluerows(x)

And similarly, any sink can accept by using logic similar to how I've updated in the DataFrames PR, basically they just do Tables.DataValueUnwrapper(x) on the result of getiterator. That way we don't have to rely on any kind of "automatic" unwrapping via Enumerable, but each sink can easily adjust by adding that line.

davidanthoff commented 5 years ago

Yeah, I think that makes sense. I also think we are getting close! I left a couple more comments on the DataFrames.jl PR, and I'll run some more tests, and I probably should take a little more time to review it all once more.

davidanthoff commented 5 years ago

I think I need to find another hour to really try to understand the code here, but so far what I'm seeing looks good. But, I'm not fully understanding everything yet, so I would like to make sure I do :) I should have time tomorrow evening to do that.

One thing, though: are @pure annotations like this one here correct? They strike me as wrong. Isn't one of the rules that one is not allowed to call generic functions from a pure function, and especially not if a method might be added to that function. And that seems precisly what is happening here, as soon as someone loads DataValues.jl, this method will be added. And I think that makes that @pure annotation incorrect. I'm not super firm on this @pure stuff, though, but that is my understanding.

quinnj commented 5 years ago

Hmmm, I think you're right. Before, the datavaluetype and nondatavaluetype definitions were statically defined in the same place (and no other methods would be added). Let me look into this some more, but we might need a generated function instead.

davidanthoff commented 5 years ago

Alright, let me know when you have a workaround for the @pure issue and then I'll try to review the whole thing in one go!

quinnj commented 5 years ago

I actually think we might be legal w/ our use of @pure still. Here's my line of thinking:

This also comes from my investigation that inference seems to have no actual issues w/ compiling/inferring these methods before & after DataValues.jl is loaded.

So in my mind, I think we should be good: it would be good to have @vtjnash chime if some of my understanding is incorrect here.

quinnj commented 5 years ago

Closing as fixed for now; as mentioned and as @vtjnash has mentioned in other places, I think our use of @pure is fine for this limited use-case.

davidanthoff commented 5 years ago

That is not how I understood @vtjnash. I understood him to say that you can decorate a function as @pure if you only call functions from it where you can reasonably assume that no methods will be added that would be called by your @pure function. The situation here is quite different, though: you are actually expecting that methods will be added to functions that are called inside your @pure function, and you then want your pure function to call these new methods. That seems a very different situation from what @vtjnash described on slack.

Now, maybe it does currently work, but that to me looks like relying on very specific compiler behavior in a situation that essentially one should never be in... Can't we solve the same issue with a generated function? It seems really brittle to me for such a low level package to rely on such specific compiler behavior...

vtjnash commented 5 years ago

I talked to Jacob offline, and confirmed that this usage is probably OK with the compiler. Your understanding is correct (even the bit about depending on specific compiler behavior), I just think it'll work out alright for awhile.

quinnj commented 5 years ago

@vtjnash also mentioned that compiler improvements in the 1.x timeframe will make @pure unnecessary, so at that point, we can remove the annotations. The problem with using a generated function, as far as I understand it, is that this use-case is specifically disallowed.

In the body of the generated function you only have access to the types of the arguments – not their values – and any function that was defined before the definition of the generated function.

So while @pure might rely on 1.0-specific compiler behavior, it at least works and we have assurances from jameson that we won't even need it in the future.

davidanthoff commented 5 years ago

The problem with using a generated function, as far as I understand it, is that this use-case is specifically disallowed.

I don't see how the language you quote applies to the situation here. What would happen is that the code behind the @require clause would add a method (not a function) to a function that is used inside the body of the generated function. As far as I can tell that works, and I don't see any language in the documentation that would discourage that. The language you quote here says that you can't use a function that wasn't defined earlier, but that doesn't seem to be the case here.