JuliaCollections / DataStructures.jl

Julia implementation of Data structures
https://juliacollections.github.io/DataStructures.jl/latest/
MIT License
691 stars 246 forks source link

Split up the package? #310

Open ararslan opened 7 years ago

ararslan commented 7 years ago

This package has become quite large and is something of a heavyweight dependency for packages that just need a single type. An example of that is StatsBase, which only uses DataStructures for the arrays as heaps stuff. Indeed, just going through and trying to fix the deprecation warnings in this package for more recent Julia versions has proved quite cumbersome. It seems to me that for the sake of maintainability, the easily separable pieces of this package could be split off into separate packages, say like Heaps.jl, Deques.jl, etc. That way packages can just pull in the functionality they need. Thoughts?

yuyichao commented 7 years ago

just going through and trying to fix the deprecation warnings in this package for more recent Julia versions has proved quite cumbersome

What about it is cumbersome? Splitting part that doesn't depend on anything else might be ok though I don't think depwarn fix is a good reason for it and I don't see how splitting will help there. The depwarn fix now needs to be down on multiple packages rather than one.

ararslan commented 7 years ago

That wasn't really an argument for splitting up so much as just a comment on the fact that it's big enough that fixing depwarns is annoying to do all at once.

TotalVerb commented 7 years ago

I also think this package is unnecessarily large. It can reasonably be changed to simply re-export the exports of smaller packages which become dependencies, to maintain the current convenience while allowing more modular development

ararslan commented 7 years ago

Exactly what I was thinking, @TotalVerb.

oxinabox commented 7 years ago

Yes, please lets split it up. It is so large. I keep wanting to fix things, or refactor them, etc. But the weight of all the files gets me down. There is a lot of cruft in the code. It has good test coverage, so refactoring and breaking down should be safe. (Can keep a copy of all the tests in DataStructures.jl and then run them using the reexported ones until happy)

Breaking it into separate repos, will make it easier to decruft it. One thing that can be noted is that it really does break up quiet distinctly Some parts were clearly written by different people, who (reasonably enough) did not touch the other parts.

Like All the Sorted stuff (see breakdown in next post) was written by @StephenVavasis . All the Heap stuff was extracted from Base.

There are basically 5 packages living in here, all with there own core author list. (and then the good work of later people maintaining them all)

I spent an hour or so checking and breaking down a potential split of the package. I'ma put the list in a separate post.

oxinabox commented 7 years ago

This is just the files in /src/

Heaps.jl

Anything to do with heaps. Lightweight package for people who just want the old priorityqueue from base, or other heap related functions.

SortedDataStructures.jl

Its basically data structures that have optimised performance because they built on top of search-trees. They should be in there own package as they take their own set of expertise to maintain. Even though they overlap with the other packages I propose that are interface driven.

SetDataStructures.jl

They all act like sets. (Fast element of-checking) Many of the packages in this say that they will be better once AbstractSet is in base. https://github.com/JuliaLang/julia/issues/5533 So they all probably need a fair bit of cleanup now

AssociativeDataStructures.jl

LinearDataStructures.jl

These 5 packages are basically not interdependent.

ararslan commented 7 years ago

Good ideas. I'm not sold on the specific names but otherwise +1.

kmsquire commented 7 years ago

These 5 packages are basically not interdependent.

Mostly true, but note that an OrderedSet (listed in SetDataStructures.jl) uses the same underlying implementation as an OrderedDict (listed in AssociativeDataStructures.jl)

oxinabox commented 7 years ago

Mostly true, but note that an OrderedSet (listed in SetDataStructures.jl) uses the same underlying implementation as an OrderedDict (listed in AssociativeDataStructures.jl)

Ah yes, I missed that one. Maybe merge AssociativeDataStructures.jl and SetDataStructures.jl, and call it "MembershipDataStructures.jl"

@ararslan and I had a discussion on specific names the other day. and concluder better names were *Collections.jl

PallHaraldsson commented 7 years ago

Pro for splitting may be maintainablility, but since not in Base, maybe for discorverability, having all the main data-structures in one place is good.

Just one question on it. For using (e.g. precompiling) is the size slowing down? I mean does Julia need to compile more than you actually use of a package? If it does now, can it be deferred in a future version of Julia?

oxinabox commented 7 years ago

Just one question on it. For using (e.g. precompiling) is the size slowing down? I mean does Julia need to compile more than you actually use of a package? If it dos now, can it be deferred in a future version of Julia?

Yes, and probably? I mean turing complete languages can do anything. But likely/easy? Less clear.

oxinabox commented 7 years ago

Ok, here is my splitting plan:

Split according to the splits covered by https://github.com/JuliaCollections/DataStructures.jl/issues/310#issuecomment-338131309 except merge Associative and Set into a single Membership package. using the names *Collections.jl

DataStructures.jl will remain as a MetaPackage that does all of them.

So the actual concrete plan: Make 5 branches on this Remote, under split/subname (and split/combined for the new metapackage). Use git history rewriting to delete all the files not (and prune the empty commits) related to the the subpackage in question. That will speed up installs, even using Pkg2 of the new subpackages. Not doing that will lead the the suffering as per https://github.com/JuliaDiffEq/DifferentialEquations.jl/issues/107. It can't be done easily for the combined package under that current name (see aforementioned suffering thread) The subpackage branchs are easy enough, they get there tests poked at and made sure they are passing CI here.

The branch with the combined meta package is a bit harder. while setting stuff up (i.e. when ever-thing is just branches) it will need a custom build script to run the existing tests and by doing a Pkg.clone.("DataStructures", ["split/...", "split/...",...]). Still shouldn't be too hard.

Then once that is all set up, everything can cruise along for a little while. PRs can be merged both into current (real) master, and also merged off into the branchs.

Once we are happy all is well, we make one last tag of DataStructures.jl 0.x.y,

Then we push all the split branches of to their own remotes, and register them. Once they are registered, we push the split/combined to DataStructures.jl master, and we tag DataStructures.jl 1.0.0

We close all issues and PRs here. and open links for the ones we care about in the new repos. ( @ararslan do you have a script for that? Something similar was done for SpecialFunctions.jl)

We then delete all the tests in DataStructures.jl, and maybe tag 1.0.1. Which might make any shallow cloning Pkg3 based method able to install DataStructures.jl lightning fast, sine it will then be a <1kb repo.

Also somewhere in this plan needs to be sorting out docs. I guess they should be split and DataStructures.jl's docs should basically (or even literally) just link to the subpackages docs.

So that is the plan. Thoughts?

(@vanbujm this was the complex git game I was talking to you about a few weeks back)

oxinabox commented 6 years ago

This is waiting on #348 and #347 as those are the big PRs that are open at the moment that affect files that would be in more than one package after the split.

s-celles commented 6 years ago

I wonder if splitting package shouldn't be done after ensuring that doctests are really run and are passing. See #345 and #361

oxinabox commented 6 years ago

No. I disagree. There will always be "one last thing". Right now there are no active PRs that are being worked on that hit files in multiple splits.

If it comes to it we can always move commits around later via the magic of git. It is particular easy to do that if all the PRs are such that they actuallychan just be replayed onto one of the other split repos.

oxinabox commented 6 years ago

The first split package Heaps.jl is up at https://github.com/JuliaCollections/Heaps.jl (locally for me it is a branch of DataStructures.jl) I've not made a branch of DataStructures that actually depends on it anywhere yet, but as a stand-alone package it is functional.

I am not porting anything deprecated across.

It is now clear to me we will either need a DataStructuresBase.jl or to do manual method merging shenanigans in DataStructures.jl. However either of those can happen later.

For my own reference the commands I need to remove things out of git history and compact it are:

git filter-branch --prune-empty --force --index-filter 'git rm --cached --ignore-unmatch  LISTFILESHERE' 

git filter-branch --prune-empty --parent-filter 'sed "s/-p //g" | xargs -r git show-branch --independent | sed "s/\</-p /g"'
rofinn commented 6 years ago

FWIW, I'd like to use a Trie for Memento.jl, but I don't really want to depend on DataStructures.jl. Would someone be willing to split that out into a separate package for me to work on (I'm guessing folks would like to keep it in the JuliaCollections org)?

iamed2 commented 6 years ago

@rofinn I can create that and add you to it

rofinn commented 6 years ago

Thanks

oxinabox commented 6 years ago

@rofinn sorry that I didn't get on with and finish the split. Got back to uni and have been in solid deadlines ever since.

Do you think Tries are good separate from other MembershipCollections? They might be, they do have other close relatives like suffex trees, and finite state acceptors.

rofinn commented 6 years ago

Would those implementations likely need to share code or define some common abstract parent type? IMHO, folks usually just want a specific data structure for their use case, so I'd only be in favour of that if it promoted some common API or avoided code duplication.

milesfrain commented 5 years ago

Another motivating reason for splitting packages is to save time on benchmarking. For example, SparseIntSet #533 should not need to re-run Heap benchmarks. These wasted cycles will compound as more packages adopt benchmarking. @oxinabox I'm happy to help with this reorganization effort. Is a good next step to create the necessary PRs to bring Heaps.jl up to date?

oxinabox commented 5 years ago

No Heaps.jl should just be recreated. It is easier to just redo the history slicing. I will delete that repo.

However, I have been thinking about this a bit more and I am no longer sure it is worth splitting up. Pkg3 makes this much faster than insrallin used to be, which was part of the original goal. Another part of the original goal was to have smaller chunks to try and document/update old cruft, but that is mostly done now.

milesfrain commented 5 years ago

That sounds reasonable. Shall we close this issue?

Regarding the benchmarking concern, we can do the following:

It would be really cool if PkgBenchmark could select the benchmarks to run based on code changes, but I don't believe that capability is on the roadmap.