alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
57 stars 14 forks source link

Process for adding python/R/Julia/Ubuntu/etc. packages #622

Closed jemrobinson closed 2 years ago

jemrobinson commented 4 years ago

We have two ways in which software packages can be made available for users inside an SRE:

  1. Baked into our "batteries included" VM image

    • Ubuntu packages
    • Julia packages
    • python packages
    • R packages
  2. Available from our mirrors of external repositories

    • python packages
    • R packages

We need to come up with a process for deciding how to approve what goes into (1) and what goes into (2). Specifically, under what circumstances would we say no to a request for a package and how do we decide whether this request should go into (1) or (2)? Where does the liability belong if a malicious package is included?

@thobson88 is going to take a look at this (initially in the context of the request for ~100 new R packages on #615).

Stages

thobson88 commented 4 years ago

Here's a proposal for a policy to decide between options 1. & 2. (above) for R & Python packages. I'll consider the other question (of when to accept/reject a package request) in another comment.

These are the criteria I used to judge the ~100 R packages in #615. I don't see any real possibility of automating this process, as it requires a judgement of whether the package is broadly useful to a cross section of researchers.

To be deemed generally useful, and therefore included in the VM image, the package should:

NOTE: based on these criteria, the following packages that are currently in the CRAN VM image list belong instead on the CRAN mirror: ape, bedr, COMBAT, fmsb, quantmod, Seurat, SnowballC, surveillance

thobson88 commented 4 years ago

The other decision, of whether to accept or reject a package, has (at least theoretically) security implications as there's no way to guarantee that a package does not contain malicious code. It's already been considered in #312.

As noted in that ticket, the whitelisting of a particular package does not mean that it's immediately included in either the VM image or the repository mirror. Instead, it's added to the list of packages that will be approved by default in the event that they are requested by a user.

But I think we need to revisit the six criteria originally proposed in #312, because they would exclude some well-established packages and some of them (points 2. & 3.) seem pretty arbitrary and don't add any obvious benefit.

We ought to be able to justify the criteria, so the question is: what makes open source software trustworthy? I'd argue that either of these are reasonable grounds:

Given that most packages on CRAN and PyPI are not digitally signed, we should focus on the first of these. This could be judged based on some combination of the following metadata (available from the package repository):

We could compute a measure of the "time spent in use", measured in days or weeks. For each each week since publication, we multiply the number of downloads in that week by the number of weeks that have passed since, and then sum all the results. We do this separately for the package and its current version, and if either exceeds some agreed threshold, then we add the package to the "approve-by-default" whitelist.

@jemrobinson any thoughts?

jemrobinson commented 4 years ago

I really like your criteria for inclusion in the base image. I think we should add these to the docs directory and try to adhere to them when considering whether to add new packages (and possibly remove some old ones if anyone has time to look back over what we currently install).

I also completely agree with:

as a sensible metric for whitelisting.

There's then a question of whether this can easily be turned into a computable metric, which is I think where #312 got stuck. If there is a sensible way to compute this (and therefore possibly automate the decision) then that's great, but otherwise we could keep these as criteria that admins should consider before adding new packages.

We should note that although we have complete control over which packages are included in Turing deployments, that's not true in general and the admins for other deployments might have different ideas about what to whitelist or not.

Do you have any further thoughts on this @martintoreilly?

thobson88 commented 4 years ago

There's then a question of whether this can easily be turned into a computable metric

Here's the "days in use" metric described above, computed for all the packages in the current CRAN whitelist (all versions, during the last year), in a histogram with log scale: hist_log10_days_in_use_1yr

Assuming we can get this data for the other package repos, could we use it pick a threshold for "days in use" above which a package is whitelisted-by-default?

martintoreilly commented 4 years ago

We could compute a measure of the "time spent in use", measured in days or weeks. For each each week since publication, we multiply the number of downloads in that week by the number of weeks that have passed since, and then sum all the results.

I think this is a decaying proxy for days in actual use. An obsolete package that was massively popular 10 years ago will have a high score that will keep increasing, even if no-one has downloaded it for a decade. It feels we ideally want a consistent level of downloads, stretching to relatively recently. If we weren't worried about having a machine computable metric to threshold / weight in an automated decision or quality score for a packages, would we really just want to eyeball a plot of downloads over time chart for each package?

Are there other ways we can test security more directly (e.g. checking CVEs for package versions, static analysis etc)?

Do we want to consider new versions of packages as inheriting all the quality metrics accumulated over all previous versions? I can see cases where the new package might be much less trusted (massive refactor) or much more trusted (fixes high risk security bug)?

martintoreilly commented 4 years ago

@thobson88 Where did you get the download data from?

JimMadge commented 4 years ago

Tangentially related @jemrobinson and I have started building a core/white-listed packages document from the scratch here https://hackmd.io/@nHslnPpLRmCxPOmQBcOW-g/SJa8Em4oI due to the (probably necessarily) large size of the current lists.

jemrobinson commented 4 years ago

@thobson88 Maybe reversing your recency weighting would help (ie. a download 1 yr ago is worth less than a download yesterday)? Possibly exponentially? Something like ndownloads * exp(-A * ndays) rather than ndownloads * ndays inside your integral? This should deal with @martintoreilly 's point.

thobson88 commented 4 years ago

@thobson88 Where did you get the download data from?

I used the cran_downloads function from the cranlogs package. Currently seeking equivalents for the other repositories.

martintoreilly commented 4 years ago

For PyPI, download stats can be accessed from the Google BigQuery PyPI stats tables. The PyPI stats API directs to the Google tables for bulk operations.

martintoreilly commented 4 years ago

Vulnerability checking

Vulnerability databases

Python

General

jemrobinson commented 4 years ago

What are we worried about?

What are we not (so) worried about?

What are the risks for the Safe Haven?

At a deeper level the main worries are:

Which types of actors are we worried about?

Which types of actors are we not worried about?

Conclusions: It seems like we want to push the burden of decision making onto the person making the request.

Policy

@martintoreilly 's thoughts:

Questions for request-makers

Summary statistics for decision makers

thobson88 commented 4 years ago

@jemrobinson I've written up a first draft in PR #671.

martintoreilly commented 4 years ago

Existence on other default package lists

I think we should add whether a package has been included in other supported package lists as part of our quality assurance signal. This surely says something about how widespread useful something is?

martintoreilly commented 4 years ago

PyPI malware checks

As of Feb / March 2020 the Warehouse repository backing PyPI has had tooling in place to run malware checks on package upload, automated schedule and admin manual trigger.

I looks like the hooks are there, but it's not clear what, if any, real checks are running in production.

Another part of the same work is looking to incorporate The Update Framework (TUF) for more secure package updates.

martintoreilly commented 4 years ago

I found an R package CVE - https://www.cvedetails.com/cve/CVE-2008-3931/

JimMadge commented 2 years ago

Closing as stale and open-ended. This would be better placed in a discussion until we have a concrete proposal.