gesistsa / rang

🐶 (Re)constructing R computational environments
https://gesistsa.github.io/rang/
GNU General Public License v3.0
77 stars 3 forks source link

Add native support for Github packages #22

Closed chainsawriot closed 1 year ago

chainsawriot commented 1 year ago
resolve(pkgs = gh("schochastics/netUtils@5e2f3ab534"), snapshot_date = '2020-08-26')

Or even to find the closet commit to snapshot_date. But gran needs to be able to emit gran object for a github package.

schochastics commented 1 year ago

I think it is best to find the closest commit to snapshot_date automatically because not everybody will know what these random letters/numbers mean and where to get them. Here is a suggestion to obtain the closest commit sha via gh

get_sha <- function(repo,date){
  commits <- gh::gh(paste0("GET /repos/",repo,"/commits"),per_page = 100)
  dates <- sapply(commits,function(x) x$commit$committer$date)
  idx <- which(dates<=date)[1]
  k <- 2
  while(is.null(idx)){
    commits <- gh::gh(paste0("GET /repos/",repo,"/commits"),per_page = 100,page = k)  
    k <- k + 1
  }
  commits[[idx]]$sha
}
repo <- "schochastics/netUtils"
date <- as.Date("2020-08-26")
get_sha(repo,date)
#> [1] "5e2f3ab53452f140312689da02d871ad58a96867"

Created on 2023-02-07 with reprex v2.0.2

Probably still error prone paging

schochastics commented 1 year ago

Just found pkgdepends, might be helpful to get dependencies of github only packages?

library(pkgdepends)
pd <- new_pkg_deps("schochastics/levelnet@775cf5e")
pd$solve()
#> ! Using bundled GitHub PAT. Please add your own PAT using `gitcreds::gitcreds_set()`.
#> ℹ Loading metadata database
#> ✔ Loading metadata database ... done
#> 
pd$draw()
#> schochastics/levelnet@775cf5e 0.5.0 [new][bld][cmp][dl] (unknown size)
#> ├─igraph 1.3.5 [new][bld][cmp][dl] (2.50 MB)
#> │ ├─magrittr 2.0.3 [new][bld][cmp][dl] (267.07 kB)
#> │ ├─Matrix 1.5-1 < 1.5-3 [old]
#> │ │ └─lattice 0.20-45 
#> │ ├─pkgconfig 2.0.3 [new][bld][dl] (6.08 kB)
#> │ └─rlang 1.0.6 [new][bld][cmp][dl] (742.51 kB)
#> ├─Matrix
#> └─Rcpp 1.0.10 [new][bld][cmp][dl] (2.94 MB)
#> 
#> Key:  [new] new | [old] outdated | [dl] download | [bld] build | [cmp] compile

Created on 2023-02-07 with reprex v2.0.2 Apologies if this is irrelevant but I am still not that familiar with the code base of gran :)

schochastics commented 1 year ago

A way without pkgdepends could be this one:

get_sha <- function(repo,date){
  commits <- gh::gh(paste0("GET /repos/",repo,"/commits"),per_page = 100)
  dates <- sapply(commits,function(x) x$commit$committer$date)
  idx <- which(dates<=date)[1]
  k <- 2
  while(is.null(idx)){
    commits <- gh::gh(paste0("GET /repos/",repo,"/commits"),per_page = 100,page = k)  
    k <- k + 1
  }
  list(sha = commits[[idx]]$sha,x_pubdate = dates[[idx]])
}

repo <- "schochastics/netUtils"
snapshot_date <- "2020-08-26"
snapshot_date <- anytime::anytime(snapshot_date, tz = "UTC", asUTC = TRUE)
sha <- get_sha(repo,snapshot_date)  
repo_descr <- gh::gh(paste0("GET /repos/",repo,"/contents/DESCRIPTION"),ref=sha$sha)
descr_df <- as.data.frame(read.dcf(url(repo_descr$download_url)))
descr_df
#>       Package                                      Title    Version
#> 1 igraphUtils A Collection of Network Analytic Functions 0.1.0.9000
#>                                                                                                        Authors@R
#> 1 person(given = "David",\nfamily = "Schoch",\nrole = c("aut", "cre"),\nemail = "david.schoch@manchester.ac.uk")
#>                                                                                        Description
#> 1 Provides a collection of network analytic functions that may not deserve a package on their own.
#>              License Encoding LazyData               Roxygen RoxygenNote
#> 1 MIT + file LICENSE    UTF-8     true list(markdown = TRUE)       7.1.0
#>              LinkingTo       Imports
#> 1 Rcpp,\nRcppArmadillo Rcpp,\nigraph

No additional dependencies except gh which one probably needs anyway but we need to parse the Description field ourselves Created on 2023-02-08 with reprex v2.0.2

chainsawriot commented 1 year ago

@schochastics So now you are a CTB.

I thought about using pkgdepends previously (#1), but decided not using it because pkgsearch::cran_package_history provides enough information (for CRAN packages).

In the long run, I think we might be better off using pkgdepends (because it supports bioc etc.). Also, opening up gran to Github also means opening up to DESCRIPTION fields such as Remotes. And pkgdepends support these.

For now, I will take your get_sha and read.dcf approach.

chainsawriot commented 1 year ago

Tag as v0.1 for now. Dunno if it can be made.

schochastics commented 1 year ago

@schochastics So now you are a CTB.

:)

For now, I will take your get_sha and read.dcf approach.

Do you want to take over integrating this into the package? I otherwise give it a shot

chainsawriot commented 1 year ago

@schochastics Please give it a shot (and be AUT)!

schochastics commented 1 year ago

I have a working version in my fork in the gh branch. The problem are system requirements. Not sure we can get this reliably from the DESCRIPTION example: igraph DESCRIPTION:

SystemRequirements:
    gmp (optional),
    libxml2 (optional),
    glpk (>= 4.57, optional)
R> remotes::system_requirements(package = "igraph",os="ubuntu",os_release="20.04")
[1] "apt-get install -y libglpk-dev" "apt-get install -y libgmp3-dev"
[3] "apt-get install -y libxml2-dev"
chainsawriot commented 1 year ago

@schochastics For now, an interim solution is to put the names of non-cran packages in a special slot inside the granlist object (e.g. output$noncran_pkgs, I don't want to call it gh_pkgs because we might need to include bioc or even local packages in the future). And probably, those non-cran packages would only be in output$grans[[x]]$original (but not output$grans[[x]]$deps, if we don't support those nonstandard DESCRIPTION fields for now). Those non-cran packages need special treatment anyway for export_granlist (probably they will be needed to install the last, preferably being cached).

When getting Sysreqs, the packages in noncran_pkgs need to be separated from CRAN packages. For CRAN packages, do the usual remotes::system_requirements thing.

For gh packages, we need to get their DESCRIPTION again (or if we can, cache the DESCRIPTION file from the previous step) and do this:

https://github.com/r-lib/remotes/blob/88fdc4eb6e64a02528d7289e1cdda6948027c301/R/system_requirements.R#L66-L88

schochastics commented 1 year ago

Thanks I'll try to get this done.

Different question: how would you indicate a gh package when providing a list of packages to resolve? My current implementation is to interpret everything with a "/" as coming from github

resolve(c("rtoot","schochastics/rtoot"))

calls .get_snapshot_dependencies_cran() for rtoot and .get_snapshot_dependencies_gh() for schochastics/rtoot. Not sure if this is the best way, but certainly the simplest?

chainsawriot commented 1 year ago

@schochastics

Slash is fine.

chainsawriot commented 1 year ago

devtools <= 1.5 should use repo and username separately (as older versions, e.g. version < 1, only support username). While > 1.5 (which is 1.6.1 onwards, i.e. snapshot_date >= 2014-10-07) should use username/repo because username is deprecated.

schochastics commented 1 year ago

ah I remember that change! I can fix that.

chainsawriot commented 1 year ago

Archiving GH packages

https://api.github.com/repos/schochastics/rtoot/tarball/50420ed

And then R CMD BUILD it?

schochastics commented 1 year ago

I guess that way we could get around the devtool issue?

schochastics commented 1 year ago

One can install that tarball directly?!?!

R> install.packages("~/Downloads/schochastics-rtoot-v0.2.0-11-g50420ed.tar.gz")
Installing package into ‘/home/david/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)
inferring 'repos = NULL' from 'pkgs'
Warning in untar2(tarfile, files, list, exdir, restore_times) :
  skipping pax global extended headers
* installing *source* package ‘rtoot’ ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (rtoot)

Something tells me that this is probably not a good idea

chainsawriot commented 1 year ago

@schochastics Yes. The R CMD BUILD step afterwards is simply for getting rid of the unnecessary files as specified in .RBuildignore, build vignette and checking and all those sundries. It is not really necessary for many (well-developed) packages such as rtoot.

My proposal above was mainly for dealing with the cache option of dockerize. But if it can be generalized and avoid the need for devtools/remotes just for the install_github that would be tremendously helpful.

And this is the super hacky version of devtools::install_github without any dependency for inside the container.

(R has a function for untarring, but it is super buggy before R 4. I have direct bad experience about it.)

(This also inspires me that the limitation of R > 2.1 #14 can be eliminated by doing stupid thing like: system(command = paste("R CMD INSTALL", tarball_path, sep = "")). Again, system2 is nicer but is a recent phenomenon.)

pkg <- "schochastics/rtoot"
sha <- "50420ed"
x <- tempfile(fileext = ".tar.gz")
y <- tempdir(check = TRUE)
download.file(paste("https://api.github.com/repos/", pkg, "/tarball/", sha, sep = ""), destfile = x) ## one concern is that Woody can't do proper https authentication; but actually http works as well
system(command = paste("tar", "-zxf ", x, "-C", y))
system(command = paste("R", "CMD", "build", list.dirs(path = y, recursive = FALSE))) # There can be multiple directories if y is reused.
## TODO: Need a way to generate `tarball_path`
tarball_path <- "rtoot_0.2.0.9000.tar.gz"
install.packages(tarball_path, repos = NULL)
unlink(tarball_path)

It also brings us to another issue: Should we store x, x_version as usual for GH packages, i.e. package name and version as per DESCRIPTION? So that we can generate tarball_path as usual. It is also beneficial for cases such as igraphUtils / netUtils.

We can store x, "schochastics/rtoot" and sha somewhere else, e.g. my suggestion: cranlist$noncran_pkgs as a vector/dataframe.

## `type` can be extended to "bioc", "local"
## handle can be github path, local path, or bioc package name
## local probably doesn't need ref, bioc might store version as ref.

data.frame(x = c("rtoot", "igraphUtils"), type = c("github", "github"), handle = c("schochastics/rtoot", "schochastics/netUtils"), ref = c("50420ed", "5e2f3ab"))

Another way is to stay as it is now and look at the DESCRIPTION once again in y to to get the ACTUAL name and version during the container building time.

schochastics commented 1 year ago

Just finished almost exactly the same hack-ish solution. If we can avoid certain issues with system, why not use it? The "R CMD" stuff is probably the most stable thing we have?

one could get the tar file created like this, but no idea how stable this really is:

res <- system(command = paste("R", "CMD", "build", list.dirs(path = y, recursive = FALSE)),intern = TRUE) 
tar_file_line <- res[grepl("*.tar.gz",res)]
tar_file_line
# regex to extract the tar.gz file

I will work on this a bit more. devtools is a bit of a pain with its dependencies

schochastics commented 1 year ago

We can store x, "schochastics/rtoot" and sha somewhere else, e.g. my suggestion: cranlist$noncran_pkgs as a vector/dataframe.

I will toy around with this, but I noticed that it becomes complicated quickly to drag along everything. The sha is enough to recreate what we need, though maybe in a cumbersome way.

Creating that dataframe might however be helpful just as a reference

schochastics commented 1 year ago

This can deal with pkg renaming. Obviously still needs some error handling

pkg <- "schochastics/igraphUtils"
sha <- "1b601a3"

x <- tempfile(fileext = ".tar.gz")
y <- tempdir(check = TRUE)
download.file(paste("https://api.github.com/repos/", pkg, "/tarball/", sha, sep = ""), destfile = x)
system(command = paste("tar", "-zxf ", x, "-C", y))
dlist <- list.dirs(path = y, recursive = FALSE)
pkg_dir <- dlist[grepl(sha, dlist)] # the sha allows to identify the dir uniquely 
res <- system(command = paste("cd ", y, " && R", "CMD", "build", pkg_dir), intern = TRUE)
tar_file_line <- res[grepl("*.tar.gz", res)]
flist <- list.files(y, pattern = "tar.gz", recursive = FALSE)
tarball_path <- paste0(y, "/", flist[vapply(flist, function(x) any(grepl(x, res)), logical(1))])
install.packages(tarball_path, repos = NULL)
unlink(tarball_path)
chainsawriot commented 1 year ago

So, let's make it like that in header.R for now.

We don't need a lot of error handling in the container building part. It's better to err when things go wrong there.

chainsawriot commented 1 year ago

5291eae

chainsawriot commented 1 year ago

This is the "for the sake of argument" test case:

x <- resolve("cran/sna", "2005-05-01")

It will generate the earliest supported version of R (2.1.0) but with Github.

schochastics commented 1 year ago

This happens in the v0.1 branch when dockerizing the above

FROM debian/eol:
ENV TZ UTC
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone && apt-get update -qq && apt-get install wget locales build-essential r-base-dev  -y
COPY rang.R ./rang.R
COPY compile_r.sh ./compile_r.sh
RUN apt-get update -qq && apt-get install -y libfreetype6-dev libgl1-mesa-dev libglu1-mesa-dev libicu-dev libpng-dev make pandoc zlib1g-dev
RUN bash compile_r.sh 2.1.0
CMD ["R"]
Sending build context to Docker daemon  10.75kB
Step 1/8 : FROM debian/eol:
invalid reference format

looks like debian_version is missing

chainsawriot commented 1 year ago

Yes https://github.com/chainsawriot/rang/tree/fixrang

chainsawriot commented 1 year ago

The github download is not possible inside Woody. We need to warn the users and ask them to cache instead.

chainsawriot commented 1 year ago

And really old packages can't be built on modern R, e.g. cran/sna.

Need to download them now, transfer them inside the container, and built there instead. (So complicated...)

chainsawriot commented 1 year ago

I think this is done.