We need the origin story for the CRANmetadata object

jennybc commented 9 years ago

I can't find the code that creates the CRANmetadata object that ships with the package. I think we should have a script in inst/ with that code. Ditto for CRANpractices, I imagine.

I'm puzzled because the CRANmetadata that I got from GitHub is an unnamed list, whereas download_packages_metadata() seems to return a named list. Small local experiments confirm this. Did CRANmetadata perhaps get created with an earlier version of this function?

Shall I work on this? Am I right that CRANmetadata.rda should be created with update_CRAN_metadata()?

dgrtwo commented 9 years ago

You are right, this must have been from an earlier version! It deserved to be updated in any case and I've done so.

I'm updating CRANpractices as well, though that may take a little longer since I have to download all sources. I'll improve the documentation to make the origin story clearer as well: I think the ideal, rather than a script, is to have each object have the code that creates it in the "example". Perhaps also a function along the lines of "update_all".

jennybc commented 9 years ago

Thanks @dgrtwo for updating the CRANmetadata object!

Re: where to put the origin story. This package is an odd hybrid of R package w/ functions and a data package. My suggestion to save a script somewhere was based on the latter mindset (though I mis-remembered the recommended "where"). I was thinking of Hadley's advice from R packages | Data | Exported data:

Often, the data you include in data/ is a cleaned up version of raw data you’ve gathered from elsewhere. I highly recommend taking the time to include the code used to do this in the source version of your package. This will make it easy for you to update or reproduce your version of the data. I suggest that you put this code in data-raw/. You don’t need it in the bundled version of your package, so also add it to .Rbuildignore. Do all this in one step with:
devtools::use_data_raw()
You can see this approach in practice in some of my recent data packages. I’ve been creating these as packages because the data will rarely change, and because multiple packages can then use them for examples ….

I don't have a strong objection to your suggestion to put the code as an example, but also submit the above for consideration. We could put the code in both places, because it won't be huge.

In the help file for update_CRAN_practices(), there's advice to "make sure you've re-built the package if you've updated that dataset". I think we should give the actual code to do that while we're at it.

Was also thinking it might be nice to create an S3 class for the CRANmetadata object, so we can create a decent print method.

I'm happy to work on any of this. Or can channel my efforts on data analysis and visualization. You both have a head start on the package-y parts ...

Ironholds commented 9 years ago

Hum. I have actually no preference. My efforts are mostly going to be focused on the prose over the next few days - getting the introduction, methodology and background bits up to scratch.

Given the first blocker is the useR presentation, I'd suggest focusing efforts on analysis and visualisation? We can hammer on the package design itself prior to the JSS/R Journal submission (this may also be before useR, but this way we know we have everything we need for useR done)

jennybc commented 9 years ago

Yeah I'm starting to analyze and visualize. It's like CHRISTMAS.

But I want to look at things over time, so I think I'm going to have to go beyond the two datasets already sitting there. What do you say we augment CRANpractices? I'm thinking date of first submission, date of most recent, and total number of versions.

Ironholds commented 9 years ago

I like this plan! I thought we'd grabbed first and last; my bad :(

jennybc commented 9 years ago

The version I'm working with appears to have no info on release dates or total:

> glimpse(CRANpractices)
Observations: 6551
Variables:
$ package          (chr) "A3", "abc", "ABCanalysis", "abcdeFBA", "ABCExtremes",...
$ casing           (chr) "Upper", "Lower", "Mixed", "Mixed", "Mixed", "Mixed", ...
$ alphanumeric     (lgl) TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
$ license          (chr) "GPL (>= 2)", "GPL (>= 3)", "GPL-3", "GPL-2", "GPL-2",...
$ links_to         (list) R, xtable, pbapply, randomForest, e1071, R, nnet, qua...
$ upstream_repo    (chr) "None/Other", "None/Other", "None/Other", "None/Other"...
$ versioning       (lgl) TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, F...
$ testing          (chr) "None", "None", "None", "None", "None", "None", "None"...
$ roxygen          (lgl) FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
$ changelog        (lgl) TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
$ vignette_format  (chr) "None", "LaTeX", "None", "None", "None", "None", "None...
$ vignette_builder (chr) "None", "Sweave", "None", "None", "None", "None", "Non...
$ downloads        (int) 921, 1002, 595, 561, 571, 549, 585, 611, 940, 547, 232...
$ links_from       (dbl) 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 66, 0, 0, 0, 1, 0, 0, 0,…

Ironholds commented 9 years ago

Oh damn, that's weird. Is it in the raw metadata? If so we don't have to re-query again, we can just patch the code and regenerate off the existing object.

jennybc commented 9 years ago

Oh sure it can be recovered from CRANmetadata. Here's the component for the first package:

> str(CRANmetadata[[1]], max.level = 2)
List of 19
 $ Package         : chr "A3"
 $ Type            : chr "Package"
 $ Title           : chr "A3: Accurate, Adaptable, and Accessible Error Metrics for<U+000a>Predictive Models"
 $ Version         : chr "0.9.2"
 $ Date            : chr "2013-03-24"
 $ Author          : chr "Scott Fortmann-Roe"
 $ Maintainer      : chr "Scott Fortmann-Roe <scottfr@berkeley.edu>"
 $ Description     : chr "This package supplies tools for tabulating and analyzing<U+000a>the results of predictive models. The methods employed are<U+00"| __truncated__
 $ License         : chr "GPL (>= 2)"
 $ Depends         :List of 3
  ..$ R      : chr ">= 2.15.0"
  ..$ xtable : chr "*"
  ..$ pbapply: chr "*"
 $ Suggests        :List of 2
  ..$ randomForest: chr "*"
  ..$ e1071       : chr "*"
 $ Packaged        : chr "2013-03-26 18:58:12 UTC; scott"
 $ Repository      : chr "CRAN"
 $ Date/Publication: chr "2013-03-26 19:58:40"
 $ crandb_file_date: chr "2013-03-26 14:58:40"
 $ NeedsCompilation: chr "no"
 $ date            : chr "2013-03-26T19:58:40+00:00"
 $ releases        :List of 6
  ..$ : chr "3.0.0"
  ..$ : chr "3.0.1"
  ..$ : chr "3.0.2"
  ..$ : chr "3.0.3"
  ..$ : chr "3.1.0"
  ..$ : chr "3.1.1"
 $ retrieved       : POSIXct[1:1], format: "2015-06-02 11:19:02"

I guess this doesn't contain the original CRAN publication date, but it provides most recent and the number of releases.

Ironholds commented 9 years ago

Oh darn :(

jennybc commented 9 years ago

@dgrtwo Are you going to regenerate CRANpractices anyway? If so, can we add some of this temporal information?

dgrtwo commented 9 years ago

Yes, absolutely, setting up the whole thing tonight. Actually I have code somewhere for grabbing the dates that I never pushed....

I had missed Hadley's Exported Data advice! WWHD applies as usual, so the use_data_raw approach sounds good.

jennybc commented 9 years ago

Also, can you add the date of download as an attribute or something? I.e. embed it in the object?

UPDATED: To be clear, I'm talking about annotating CRANmetadata and CRANpractices with the date the data was obtained from the CRAN API. If we do create a class for these, which I still think a good idea at some point, this is good info to be able to query, print, etc.

jennybc commented 9 years ago

I can't find any prose about this but from code, I gather that our download data is for a 30 day period, extending backward from the date of package metadata and source code download:

https://github.com/Ironholds/practice/blob/3fed3e828334fc2424c2abb82ea17ea5247519e6/R/check_practices.R#L77-L78

I wonder if we should capture a different time period -- a calendar year? -- to be less vulnerable to the rhythms of the R release schedule? A single month seems subject to idiosyncrasies due to where it falls relative to a release.

jennybc commented 9 years ago

I just downloaded metadata for all version of all packages:

https://github.com/metacran/crandb/issues/35

It is not onerous and should probably be the default method of creating CRANmetadata. Does anyone care to say something about how I should integrate this into the package? Otherwise I'll do something that seems smart to me and just make a PR.

I'm still not touching the source download and re-construction of CRANpractices, as I've got plenty to do and it sounded like @dgrtwo might be working on that.

Ironholds commented 9 years ago

Awesome!

...I seem to have lost a lot of time today with some sort of anti-harassment nonsense. I'll try to get to the test hand-coding tomorrow; sorry about this :/

Ironholds commented 9 years ago

Hand-coding complete!

as this shows, of the 50 "testless" packages randomly sampled, only 3 had any form of tests, all using completely different forms - one manually triggered, one requiring a custom rebuild of the package, and one existing outside of the distributed package itself.

Additionally, several packages (due to the need for examples to run quickly on CRAN) simply wrapped all examples in dontrun, undermining the ability to use examples entries as backup.

karthik commented 9 years ago

Additionally, several packages (due to the need for examples to run quickly on CRAN) simply wrapped all examples in dontrun, undermining the ability to use examples entries as backup.

Ah, the curse of CRAN!

Ironholds commented 9 years ago

We complain about BDR when we should be complaining about CPU!

jennybc commented 9 years ago

@Ironholds thanks for taking one for the team there. That will be a useful part of the story. But not a happy part.

karthik commented 9 years ago

It is such a shame that examples on CRAN cannot run for more than 5 seconds. In this day and age that seems unreasonably short, and not quite the bottleneck (much of this can be distributed) for CRAN. All it does is force packages to come up with useless examples or wrap everything in dontrun.

One idea: It would be good to see how many packages have httr/RCurl etc in Imports, and also make soem call to an API that have all their examples wrapped in dontrun. That would provide some insight into the challenges of having API packages on CRAN. It's not that they are entirely devoid of tests, they are just unable to run because of arbitrary CRAN rules. Happy to do this in a PR if it's ok with you folks.

jennybc commented 9 years ago

@karthik I'd be interested in having functions and/or data added around that question.

Ironically this package itself has a lot of \dontrun{}.

Ironholds commented 9 years ago

+1; hella-interested.

And yeah, but this package takes literally 30 minutes to run examples, so.. ;p

Ironholds commented 9 years ago

@jennybc @dgrtwo I've added some functions to grab dates and integrated them with check_practices; regenerating the metadata now!

Ironholds commented 9 years ago

It's amazing how much faster this goes when I'm not hogging the wifi at a hackathon

Ironholds commented 9 years ago

@dgrtwo need to reload the metadata to incorporate the date extraction, but I don't understand how it's generated; get_package_metadata produces a different-looking object. Help!

dgrtwo commented 9 years ago

@Ironholds Sorry I somehow missed this!

I'm not sure I can reproduce your problem: get_package_metadata produces an object that seems to me to have the same structure:

> new_metadata <- get_package_metadata()
> head(names(CRANmetadata))
[1] "A3"          "aaMI"        "abc"         "abc.data"    "ABCanalysis" "abcdeFBA"   
> head(names(new_metadata))
[1] "A3"          "aaMI"        "abbyyR"      "abc"         "abc.data"    "ABCanalysis"
> head(names(CRANmetadata[[1]]))
[1] "_id"      "_rev"     "name"     "versions" "timeline" "latest"  
> names(CRANmetadata[[1]])
[1] "_id"      "_rev"     "name"     "versions" "timeline" "latest"   "title"    "archived"
> names(new_metadata[[1]])
[1] "_id"      "_rev"     "name"     "versions" "timeline" "latest"   "title"    "archived"
> names(new_metadata[[1]]$versions)
[1] "0.9.1" "0.9.2"
> names(CRANmetadata[[1]]$versions)
[1] "0.9.1" "0.9.2"

update_CRAN_metadata works for me as well, saving the new version. Are you getting different behavior, or am I missing something?

Ironholds commented 9 years ago

Well, this is weird; now it works for me. Previously it didn't.

...what?! Goddammit brain ;p

Ironholds / practice

We need the origin story for the CRANmetadata object #3