Questions and Roadmap for covidcast on CRAN

statsmaths commented 3 years ago

Following discussion last week, I took a close look at the current R covidcast package. I tried to review it the way I would a submission for the RJournal, JOSS, ROpenSci or equivalent, with the caveat that I know it's still in the works. Overall it seems to run well, be well documented, and has the core functionality I would expect.

Here are some suggestions for how to streamline the package for getting it onto CRAN. I am sure some of these are bad ideas, but though I would err or the side of too many ideas rather than too few:

Standardize all of the = assignments to <-.
The dplyr/tidyr/purrr functions are used in a non-trival way, so I think they are fine even though they introduce a lot of dependencies. But, I recommend removing pipes. Great for interactive work but hard to debug, particularly deep within a package. This should not be a heavy lift.
Clean up the format and number of global variables. Some all ALL_CAPS, others are not. Most seem like they are only used once and could just be defined within the scope of a function.
Try to see if some of the Imports that we are only using one or two function from are really needed. A few seemed easily replacable.
The metadata files are a bit too big and contain a lot of variables. Can we limit these to just a few for the package? Only a few are documented in the package anyway. I don't think they will pass CRAN with the current sizes.
There should be a more complete README.md for the package with a simple worked out example.
The vignettes are nice and very informative, but take far too long to build. Should we replicate them with static data, remove the time-consuming bits, or provide pre-built HTML files to CRAN?
Is there any possibility of keeping the R package in its own repository? This one is huge (as you know) and one-repo-per-package is defintiely the norm. This isn't needed at all for CRAN, and might have other drawback, but just thought I would ask as it would certainly simplify some things, particularly as the project becomes more maintainence rather than active development.
I think the biggest challenge is going to be supporting the plotting, particularly the spatial maps. What's the thought on the extent to which this code and data needs to be in the package? .... My usually preference is to have a package provide well-formated data and produce plots myself (full control and no need to learn a bunch of other plotting APIs), but I know that's not everyones preference. At any rate, the shape files are just too big for CRAN and we need some alternative.

That's all I have for now. Once I have a sense of how to address these, happy to take them on as soon as possible.

capnrefsmmat commented 3 years ago

The metadata files are a bit too big and contain a lot of variables. Can we limit these to just a few for the package? Only a few are documented in the package anyway. I don't think they will pass CRAN with the current sizes.

I suspect we could subset out a lot of the variables in the shapefiles, particularly if there's a third-party Census package that's easy to use, so people could get any other data by using it. It may also be possible to store shapefiles compressed, or convert to another more compact format supported by sf.

The vignettes are nice and very informative, but take far too long to build. Should we replicate them with static data, remove the time-consuming bits, or provide pre-built HTML files to CRAN?

In #266 (only partially merged) I explored some ways to make them faster. 7f7fd89 and 0ebe6ed definitely made the vignettes much faster, at the cost of requiring even more data to be stored in the GitHub repo (for the second one), and may be worth exploring again. The first commit (do state-level correlation analyses instead of county) would be a big benefit.

I think the biggest challenge is going to be supporting the plotting, particularly the spatial maps. What's the thought on the extent to which this code and data needs to be in the package? .... My usually preference is to have a package provide well-formated data and produce plots myself (full control and no need to learn a bunch of other plotting APIs), but I know that's not everyones preference. At any rate, the shape files are just too big for CRAN and we need some alternative.

Ordinarily I'd agree, but I think the process of making choropleths is fiddly enough that including these functions allowed us to make a lot of interesting plots we otherwise wouldn't have. Had I designed the plotting functions initially, I'd have made them much more bare-bones and let the user customize via ggplot instead of having a bunch of arguments, though. The complex structure needed to support many different options has made it very easy to accidentally break the plots.

statsmaths commented 3 years ago

Thanks for the quick reply!

For the vignettes, how about we use the approach that jsonlite uses here of pre-building the vignettes. That avoids needing to include extra data and lets the vignettes continue to serve as an extra layer of tests. It will show up correctly on CRAN and doesn't constrain anyone in the future from adding more functionality, but still runs very fast. I think that would be an easy change if you're okay with it.

I won't push on changing the plotting code, but keeping it does mean that we'll need to figure out how to include the shapefiles in the package. CRAN recommends (and as you know, CRAN recommendations are more like requirements) that the entire package be under 5MB. The current tarball is 15MB. How would you suggest going forward? I can convert the shapefiles to compressed GeoJSON; those are a bit smaller, but unfortunately need to be uncompressed to read into R and collectively the four shape files are still over 5MB just on their own. We could aggresively downsample the resolution, or perhaps provide something like a "covidcast_download_metadata()" function to externally get all of the shape files (benefit being we could include lots of other data there too, but it is an extra step).

capnrefsmmat commented 3 years ago

Pre-building the vignettes sounds fine to me, as long as the size of the generated HTML + PNGs isn't excessive. Probably we should also simplify the vignettes (like doing state-level correlations instead of county) just as a quality-of-life improvement so we're not waiting a long time every time we have to rebuild the vignettes.

I assume the largest shapefiles are the county shapefiles of the US, right? I think there are several things we can try, depending on the reason they're large:

Downsampling is definitely reasonable, since the county shapefiles likely include county boundaries in detail, whereas the closest we ever zoom in is the state level. According to this we can simplify the polygons with a tolerance, and we can try a few tolerances to see how much file size reduction and whether the differences are visible in generated plots.
The shapefile also includes dozens of Census data columns per county, and we could subset that to only two or three (such as the most recent population data) and let people pull and merge other Census data if they want it.

If that cuts down the file size dramatically, I'd be happy there. Evidently usmap manages to include all county shapes in only 2-3MB, so it must be possible.

(And if you're wondering, we don't use usmap because they don't support Puerto Rico, which we have data on)

krivard commented 3 years ago

Is there any convention in R packages for downloading datasets (or shapefiles) separately from installing the package, a la python nltk?

statsmaths commented 3 years ago

I don't know of a standard. The two approaches I am familar with are (i) downloading into the location where the package is installed or (ii) downloading to a location specified by the user, or in the active directory.

(i) is really convenient, but can run into permission issues; it might also technically be frowned on.

bnaras commented 3 years ago

One can create a data-only package (even if it is big) and CRAN will accept it if it is not updated often.

krivard commented 3 years ago

oh interesting, so we could have covidcast with covidcast-maps as a dependency, and just stuff all the shapefiles in there

statsmaths commented 3 years ago

I just wrote a pull request that address the major issues here (see mention above). It passes CRAN checks on my machine. Other than the shapefiles being compressed and the metadata tables having fewer columns, it has the same functionality as the previous version.

Please let me know if there is something I need to do other than posting to get the branch reviewed (I can't remember if there is another process for requesting that).

statsmaths commented 3 years ago

Thanks for all of the feedback. My changes are merged now into r-pkg-devel. Should be possible to upload to CRAN (though always good to check in WinBuilder for any Windows issues) if there is nothing major that should be added/updated first.

capnrefsmmat commented 3 years ago

I just created a milestone for the changes I think would be good to get before we submit to CRAN, and if we can get through those, I think I'd be happy to push an initial CRAN version and go from there.

statsmaths commented 3 years ago

@capnrefsmmat ~~The new branch that was merged into r-pkg-devel breaks R CMD CHECK.~~ Actually, that might have been my fault (it breaks if you actually run the "make.R" file to create the data, but not in its current state).

But still, would it make sense to add R CMD CHECK into the CI tests so that something does not break it? I think the check currently only runs the testthat functions.

capnrefsmmat commented 3 years ago

I think the CI should be doing R CMD CHECK:

https://github.com/cmu-delphi/covidcast/blob/edf101b4beb23b9a29acda0e1c76a49b8c679bd5/.github/workflows/r_covidcast_ci.yml#L51

If there's a case where it's not working right, or maybe if we want to convert warnings to errors so we don't reintroduce CRAN issues, we can try adjusting that line.

statsmaths commented 3 years ago

Oh yes, that makes more sense why I was still seeing local issues despite passing the tests. Yes, maybe elevate warnings to errors? I will also try to make a habit of checking the testing logs to NOTES (which can be banal like "New package", or actually quite difficult to fix, like committing large files).

capnrefsmmat commented 3 years ago

I think a PR to set error_on = "warning" would be welcome. We couldn't do it before because we had lingering warnings, but now that they're supposed to be gone, we can try to prevent them.

capnrefsmmat commented 3 years ago

@statsmaths Are you aware of any other issues we should resolve before submitting to CRAN, or shall we go ahead and do it? It looks like all the issues we listed as blockers are fixed. I'm not sure how the submission process works -- can multiple people have access to submit and make new versions, or should one person be responsible for all CRAN stuff?

statsmaths commented 3 years ago

No other blocking issues that I am aware of. I think it's ready for CRAN unless you know of anything I don't.

There needs to a single maintainer that CRAN can contact, though there can be any number of authors and others can switch out as maintainer with each release if needed.

cmu-delphi / covidcast

Questions and Roadmap for covidcast on CRAN #470