JuliaData / DataTables.jl

(DEPRECATED) A rewrite of DataFrames.jl based on Nullable
Other
29 stars 11 forks source link

Drop NullableArrays in favor of Nulls #66

Open quinnj opened 7 years ago

quinnj commented 7 years ago

Travis won't pass here yet because Nulls.jl isn't registered and we'll need a CategoricalArrays tag as well, but they pass for me locally. Wanted to put this up to start getting eyes on it and feedback on design.

quinnj commented 7 years ago

Oh, other piece I still need to work on is DataStreams; I'm doing some other improvements along with porting to Nulls, so I'll try to get that piece ready soon.

cjprybol commented 7 years ago

This looks great so far!

quinnj commented 7 years ago

Alright, did another pass to cleanup the uses of Vector as well as isequal => ==. It certainly helps readability a ton to just be able to do == everywhere. @nalimilan, what are your thoughts on CategoricalArrays? I registered Nulls.jl, so it can be included in REQUIRE now. Are we ready to start pulling the trigger here? Here's my thoughts:

Then I can work on porting the rest of the DataStreams packages over.

nalimilan commented 7 years ago

CategoricalArrays will probably need some more work, in particular to remove NullableCategoricalArray. I need to have another look. Then I'm OK with your plan. We should also tag versions right before dropping Nullable support.

How confident are you about the replacement of NullableArray calls in tests?

@cjprybol Could you check that the changes to the tests you added recently are correct? That will be a good example of possible issues, which is still fresh in our minds.

rofinn commented 7 years ago

Should we try and get this working so we can backport it to DataFrames.jl? I'm guessing @quinnj is pretty busy right now, but I'm not sure what the protocol is for other people fixing up old PRs. I'm guessing #83 still needs to be finished up and merged into this branch first?

cjprybol commented 7 years ago

I think #83 is done unless anyone else has comments they'd like to make on it. #83 won't pass tests until the two PRs in https://github.com/JuliaData/CategoricalArrays.jl/pull/70 and https://github.com/JuliaData/DataStreams.jl/pull/36 are merged and tagged as versions but it does pass for me locally when I have those branches checked out. We could help out with tracking down the remaining test failures in https://github.com/JuliaData/DataStreams.jl/pull/36 and https://github.com/JuliaData/CategoricalArrays.jl/pull/70 to speed the process along. It looks like https://github.com/JuliaData/DataStreams.jl/pull/36 is waiting for NamedTuples to get merged into Base in https://github.com/JuliaLang/julia/pull/22194, and that still has one remaining test failure as well (but otherwise looks very close to completion).

rofinn commented 7 years ago

Hmmm, it'd be nice if the DataStreams.jl PR supported 0.6. I'll look into that.

quinnj commented 7 years ago

I'm aiming to do a 0.6 DataStreams release this week (optimistically)

nalimilan commented 7 years ago

Cool! So what remains to do here? There are a few unadressed comments above, but except that, are we OK (once dependencies will be merged and REQUIRES updated)?

cjprybol commented 7 years ago

I just noted everything else I think needs to be checked or changed, I'll push updates to address those. After that, yes, I think it's just waiting for METADATA.jl commits to merge on the dependencies that have switched to Nulls.jl and then setting those versions as the minimum versions here in REQUIRE. WeakRefStrings has been tagged with a version, DataStreams is in progress, and I'll take care of CSV as mentioned in https://github.com/JuliaData/CSV.jl/pull/94. @nalimilan have you tagged a new version of CategoricalArrays?

nalimilan commented 7 years ago

Great, thanks! I'm waiting for https://github.com/JuliaLang/METADATA.jl/pull/10915 before tagging CategoricalArrays.jl, to avoid conflicts when adding upper bounds to DataTables.

cjprybol commented 7 years ago

That should take care of all of the comments so far.

Still todo:

quinnj commented 7 years ago

I'm planning on doing the first point this week (if we can ever get the new DataStreams tag out 😲 ).

quinnj commented 7 years ago

Ok, pushed the updated DataStreams code to this branch, along with a commit addressing @nalimilan's comments from the other PR.

nalimilan commented 7 years ago

Thanks. I think we're good now, just need to wait until the dependencies are tagged.

quinnj commented 7 years ago

Ok, so we're waiting on some tags here, but is the idea to merge this, then merge/port it over to DataFrames directly?

nalimilan commented 7 years ago

Better not merge this in DataTables at all, that would break users' code for no good reason. Let's backport it directly to DataFrames.

quinnj commented 7 years ago

Ah, cool. Sounds good.

quinnj commented 7 years ago

Hmmm, so if we cherry-pick this branch over to DataFrames, I'm wondering how tagging will go.

It seems like there'd be a circular dependency between CSV (having DataFrame as the "default" sink to stream to) and DataFrames (using CSV for IO). Any suggestions? Should we just plan on tagging them together?

ararslan commented 7 years ago

Has it been long enough that we can rip the Band-Aid off in DataFrames and remove the CSV I/O functionality in favor of just using the CSV package on its own? Otherwise I'm cool with tagging them together.

quinnj commented 7 years ago

I'm fine w/ keeping CSV in DataFrames since I have plans of removing DataFrame as the default in CSV (in favor of a bare-bones NamedTuple table type).

ararslan commented 7 years ago

I have plans of removing DataFrame as the default in CSV (in favor of a bare-bones NamedTuple table type).

I was hoping that DataFrame would become that bare-bones NamedTuple-based table type. That's really all it needs to be; the rest of the stuff in the DataFrames package can be separated into other, more specific packages.

rofinn commented 7 years ago

@quinnj Does CSV even need a "table" type or would it be able to get away with just returning an iterable of NamedTuples? FWIW, I'm fine with keeping the IO code in DataFrames.jl for now, we can always move it into DataFramesIO.jl package when we're ready to break things up.

davidanthoff commented 7 years ago

@rofinn CSVFiles.jl returns an iterator of named tuples.

nalimilan commented 7 years ago

I agree with @ararslan: better define DataFrame in a small DataFramesBase package if you're worried about dependencies, rather than creating a new competing "lightweight" table type. We badly need standardization of the ecosystem, not adding even more new competing types.

Concretely, is tagging both CSV.jl and DataFrames in the same PR an issue?

quinnj commented 7 years ago

the only issue is that neither tests will pass on travis because each needs the latest code. but I have both passing tests locally and we can do some more rigorous local testing before we do an official tag of both. unless that will be a non-starter for @tkelman

ararslan commented 7 years ago

Each needing the latest code and tagging both at once is a pretty common scenario on METADATA. That should be fine.

quinnj commented 7 years ago

cool. @rofinn, @nalimilan, @cjprybol , are any of you planning on pulling this over to the big DataFrame PR? Or would you like me to do it?

tkelman commented 7 years ago

Is one a test dep of the other? How would the requirements on one another be expressed?

quinnj commented 7 years ago

CSV:

julia 0.6
...
DataFrames 0.11.0 # or whatever the new tag will be

DataFrames

julia 0.6
...
CSV 0.2.0 # or whatever the new tag will be

not test deps, actual deps in the main REQUIRES files.

tkelman commented 7 years ago

They both import each other, it's a circular dependency? That's a little strange.

quinnj commented 7 years ago

Yeah,

Maybe there's some way to hack around the "CSV needs DataFrames" requirement. As has been mentioned, one solution would be to split the actual DataFrame type definition into it's own DataFrameType package, which both CSV & DataFrames could then depend on. Seems a bit fiddly, but would get the job done if it's too much of an issue to have circular package dependencies.

nalimilan commented 7 years ago

To be clear, we only need CSV.jl so that the readtable deprecation provides a working replacement for a few releases, right? That's not supposed to be permanent?

EDIT: Actually, we don't even provide a working deprecation for readtable. So why do we need CSV.jl for (except tests)?

quinnj commented 7 years ago

I know at some point we discussed breaking up DataFrames into subpackages and that DataFrames would become a sort of meta-package that included a bunch of sub-related things (DataFrameIO, StatsModels, TableOperations, DataFrameType, etc.). As mentioned above, that would be a way to solve the circular dependency problem w/ CSV, but DataFrames the meta package could still rely on CSV for IO (since most people would expect to be able to do IO out of the box).

nalimilan commented 7 years ago

Yeah, but for now we don't seem to use CSV.jl, so let's just go ahead without it as a first step?

rofinn commented 7 years ago

I'd also support leaving CSV.jl dependency out of DataFrames for now. It would help us get things merged into master with passing tests (at least on 0.6) and might make the breakup process slightly easier.