JuliaStats / RDatasets.jl

Julia package for loading many of the data sets available in R
GNU General Public License v3.0
160 stars 56 forks source link

Merge in 90 new datasets #41

Closed randyzwitch closed 6 years ago

randyzwitch commented 8 years ago

(Reverts johnmyleswhite/RDatasets.jl#40. Sorry for the noise of reverting a reverted pull request.)

This effectively re-opens the pull request, so that it can be commented on before merging to master.

randyzwitch commented 8 years ago

@johnmyleswhite was there something else you wanted me to do before merging this? You had some concerns when I merged it the first time, but you didn't say what they were

johnmyleswhite commented 8 years ago

My concern with this change was that it's a substantive design change and that it adds a lot more content to a package that already takes a very long time to download. I think it would be good to get more feedback on that kind of design decision before merging.

In this case, one unique kind of problem is that reverting the changes in Git doesn't actually purge the history of the extra data load, so the package becomes heavier even if you don't gain any functionality. That's a special reason for a change like this to be more vetted: it's actually completely permanent and version control exacerbates the problem rather than improves it.

The other concern I had was that we generally only merge squashed changes. I'm going to use GitHub's new merge control functionality to ensure that we only bring in squashed changes.

randyzwitch commented 8 years ago

My apologies if you consider this repo to be irrevocably ruined. It was certainly not my intention to do anything other than make more datasets available.

If you'd like, please feel free to remove me as a contributor, as it seems like due to the package weight concerns there really shouldn't be any additional need for updates/maintenance. And certainly feel free to reject the PR as well, though from your description, it sounds like it's already a permanent part of history.

On Apr 30, 2016, at 11:32 AM, John Myles White notifications@github.com wrote:

My concern with this change was that it's a substantive design change and that it adds a lot more content to a package that already takes a very long time to download. I think it would be good to get more feedback on that kind of design decision before merging.

In this case, one unique kind of problem is that reverting the changes in Git doesn't actually purge the history of the extra data load, so the package becomes heavier even if you don't gain any functionality. That's a special reason for a change like this to be more vetted: it's actually completely permanent and version control exacerbates the problem rather than improves it.

The other concern I had was that we generally only merge squashed changes. I'm going to use GitHub's new merge control functionality to ensure that we only bring in squashed changes.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub

tkelman commented 8 years ago

If this hasn't been tagged yet, it is safe to rebase out the offending content, and I believe only people who've pulled master would notice.

StoneCypher commented 8 years ago

for the record, i don't find the increased size to be all that problematic. then again my use cases are strange

shashi commented 7 years ago

Well now that the commits containing the PR was reverted and both those commits have been tagged, the files will be in git history forever. Now if this PR contains the exact same files, Git they will be blobs with the same content and hence name (SHA1) so merging this won't bloat the package anymore. Might as well make the new datasets available. I also think that download size is generally not a big problem. If I am guessing correctly, having more datasets should not affect package load time, right?

randyzwitch commented 6 years ago

Going to close/re-open to trigger CI

randyzwitch commented 6 years ago

Still passes 0.4 and 0.5, fails on 0.6 and nightly as expected.

Since this has been committed to the repo anyway, it feels like these datasets should be merged in as a patch version. If anyone has strong concerns, please voice them, otherwise I will use votes on this message as my guiding principle (positive votes merge, negative don't merge and close)

StoneCypher commented 6 years ago

seven months later, a positive vote has emerged