cran-task-views / ctv

CRAN Task View Initiative
82 stars 13 forks source link

main vs. master branch #21

Closed zeileis closed 1 year ago

zeileis commented 2 years ago

Is there a simple way to find out what the main branch in a GitHub repository is called? I'm asking because the CRAN script we're planning to use downloads all .md files from

https://raw.githubusercontent.com/cran-task-views/%s/main/%s.md

where %s is substituted by the task view name.

I have used the main branch because some groups find master offensive (due to the connection to slavery) and I find main the better description anyway. The only task view maintainer who objected to this is you, Dirk @eddelbuettel . Would you please reconsider your decision or provide some way how we can determine the name of the main/master branch?

rsbivand commented 2 years ago

https://davidwalsh.name/get-default-branch-name suggests using git remote show and extracting the value from the output. https://usethis.r-lib.org/reference/git-default-branch.html suggests a usethis approach. Untried.

zeileis commented 2 years ago

Thanks. The first requires git to be available and the second that the repositories are actually cloned, I think. But maybe I'm missing something. A unified name would still be preferable, I think.

rsbivand commented 2 years ago

From the prompt:

$ curl https://api.github.com/repos/cran-task-views/Spatial/branches
[
  {
    "name": "jnrev",
    "commit": {
      "sha": "259c62603e8bbe5358211a3006b8ae040dded1ad",
      "url": "https://api.github.com/repos/cran-task-views/Spatial/commits/259c62603e8bbe5358211a3006b8ae040dded1ad"
    },
    "protected": false
  },
  {
    "name": "main",
    "commit": {
      "sha": "bf431b6e04395e818673b2467c750da268e735d0",
      "url": "https://api.github.com/repos/cran-task-views/Spatial/commits/bf431b6e04395e818673b2467c750da268e735d0"
    },
    "protected": false
  }
]

In R with curl:

library(curl)
req <- curl_fetch_memory("https://api.github.com/repos/cran-task-views/Spatial/branches")
> jsonlite::prettify(rawToChar(req$content))
[
    {
        "name": "jnrev",
        "commit": {
            "sha": "259c62603e8bbe5358211a3006b8ae040dded1ad",
            "url": "https://api.github.com/repos/cran-task-views/Spatial/commits/259c62603e8bbe5358211a3006b8ae040dded1ad"
        },
        "protected": false
    },
    {
        "name": "main",
        "commit": {
            "sha": "bf431b6e04395e818673b2467c750da268e735d0",
            "url": "https://api.github.com/repos/cran-task-views/Spatial/commits/bf431b6e04395e818673b2467c750da268e735d0"
        },
        "protected": false
    }
]
> grep("main", rawToChar(req$content))
[1] 1
> grep("master", rawToChar(req$content))
integer(0)
> req <- curl_fetch_memory("https://api.github.com/repos/cran-task-views/HighPerformanceComputing/branches")
> grep("main", rawToChar(req$content))
integer(0)
> grep("master", rawToChar(req$content))
[1] 1

I tried with RCurl, but do not know how to set Request forbidden by administrative rules. Please make sure your request has a User-Agent header (http://developer.github.com/v3/#user-agent-required). Check https://developer.github.com for other possible causes.

eddelbuettel commented 2 years ago

You likely want to look into setting a GITHUB_PAT with proper credits. github docs on creating a PAT.

After that the gh package works. I keep a main helper scripts around to 'walk' through my course org to find my student projects; starting from that gives a fairly straightforward solution _given that I have a GITHUB_PAT in my environment variables_:

res <- gh::gh("GET /orgs/:org/repos", org="cran-task-views", .limit=200)    # jask for info on repos in org
# browse via `str(res[[1]])`, say
do.call(rbind, lapply(res, \(x) data.frame(name=x$name, default_branch=x$default_branch)))

and that gets us

> res <- gh::gh("GET /orgs/:org/repos", org="cran-task-views", .limit=200)
> do.call(rbind, lapply(res, \(x) data.frame(name=x$name, default_branch=x$default_branch)))
                        name default_branch
1            WebTechnologies           main
2            ModelDeployment           main
3                  Hydrology           main
4                  Databases           main
5                        ctv           main
6               Econometrics           main
7       ctv-from-svn-2021-09         master
8               ctv-from-svn         master
9                   Bayesian           main
10                  ChemPhys           main
11                   Cluster           main
12            Environmetrics           main
13        ExperimentalDesign           main
14                   Finance         master
15            FunctionalData           main
16           GraphicalModels           main
17  HighPerformanceComputing         master
18         Hydrology-R-Forge           main
19           MachineLearning           main
20            MedicalImaging           main
21              MetaAnalysis           main
22               MissingData           main
23 NaturalLanguageProcessing           main
24        OfficialStatistics           main
25          Pharmacokinetics           main
26             Psychometrics           main
27      ReproducibleResearch           main
28                   Spatial           main
29                  Survival           main
30        TeachingStatistics           main
31                TimeSeries           main
32                  Tracking           main
33            ClinicalTrials           main
34     DifferentialEquations           main
35             Distributions           main
36              ExtremeValue           main
37                  Genetics           main
38      NumericalMathematics           main
39              Optimization           main
40             Phylogenetics           main
41                    Robust           main
42            SpatioTemporal           main
43   WebTechnologies-R-Forge           main
> 

Now, to just get that one file from each repo if it were me I might just fetch the repo zip archive and extract. No PAT, no pain:

> tf <- tempfile()
> download.file("https://github.com/cran-task-views/WebTechnologies/archive/refs/heads/main.zip", tf)
trying URL 'https://github.com/cran-task-views/WebTechnologies/archive/refs/heads/main.zip'
downloaded 17 KB

> unzip(tf, files="WebTechnologies-main/WebTechnologies.md")
> head(readLines("WebTechnologies-main/WebTechnologies.md"))
[1] "---"                                  
[2] "name: WebTechnologies"                
[3] "topic: Web Technologies and Services" 
[4] "maintainer: Mauricio Vargas Sepulveda"
[5] "email: mavargas11@uc.cl"              
[6] "version: 2022-01-23"                  
> 

That still has the branch name in the directory but that is readable locally quite easily.

eddelbuettel commented 2 years ago

Or, of course, git clone. We are having a discussion here about reinventing the protocol. I use something like the gh call above to find the list of (student) repos, then filter out admin ones, and then in a first pass (== no target dir exists) to git clone and in all subsequent runs do git pull per repo. Perfect to update all repos prior to marking etc. Same here: the list of repos will not change often, nor will the content. So asking git to get us changed content is more or less what git was invented for and there is probably no good reason to reinvent it :wink:

zeileis commented 2 years ago

Or you could just change from master to main. All of the solutions above are much more involved than the solution we have set up so far. And clearly a unified default branch name would be more transparent anyway.

Pulling would be good as long as the repositories to pull from are unchanged. But it gets more involved when we establish new task views or task views get archived. In the old R-Forge-based setup that required a person with CRAN access to change the list of task views there. However, the new script works without CRAN access.

And I'm sure that there are ways to work around that automatically as well. But I don't think that it is worth the effort just to accomodate a non-main default branch name for some maintainers.

eddelbuettel commented 2 years ago

If you git clone && git pull it is orthogonal to what the default branch is called. The file will just be there. Works for me, but preferences differ.

zeileis commented 2 years ago

We have now implemented the latter solution (git clone & git pull).

Nevertheless I'm in favor of a simple/unified/uncontroversial main default branch. I don't think we have seen arguments against that.

jennybc commented 2 years ago

You can use HEAD as the ref and it will correspond to whatever the default branch is for that repo.

It seems to work in the type of URL you're interested in at the top:

https://raw.githubusercontent.com/cran-task-views/TimeSeries/HEAD/TimeSeries.md (main is default branch)

https://raw.githubusercontent.com/cran-task-views/Finance/HEAD/Finance.md (master is default branch)

zeileis commented 2 years ago

Nice! Thanks for the hint, Jenny.

zeileis commented 1 year ago

Closing this for now, given it wasn't active for a year. Will revisit when we put together guidelines for task view maintainers.