jemrobinson commented 4 years ago

We have two ways in which software packages can be made available for users inside an SRE:

Baked into our "batteries included" VM image
- Ubuntu packages
- Julia packages
- python packages
- R packages
Available from our mirrors of external repositories
- python packages
- R packages

We need to come up with a process for deciding how to approve what goes into (1) and what goes into (2). Specifically, under what circumstances would we say no to a request for a package and how do we decide whether this request should go into (1) or (2)? Where does the liability belong if a malicious package is included?

@thobson88 is going to take a look at this (initially in the context of the request for ~100 new R packages on #615).

Stages

[x] Design a process for deciding whether to include any particular package (done in #671)
[ ] Automatically generate a nice 1-page PDF summary for each proposed package which highlights any CVEs/safety concerns and lists eg. download stats etc.

thobson88 commented 4 years ago

Here's a proposal for a policy to decide between options 1. & 2. (above) for R & Python packages. I'll consider the other question (of when to accept/reject a package request) in another comment.

These are the criteria I used to judge the ~100 R packages in #615. I don't see any real possibility of automating this process, as it requires a judgement of whether the package is broadly useful to a cross section of researchers.

To be deemed generally useful, and therefore included in the VM image, the package should:

implement at least one generic (i.e. not domain-specific) statistical algorithm or method, or
provide support for a cross-cutting analysis technique (e.g. geospatial data analysis, NLP), or
facilitate data science or software development best practices (e.g. for robustness, correctness, reproducibility), or
enhance the presentational features of the programming language (e.g. for producing plots, notebooks, articles, websites), or
enhance the usability of the programming language or development environment (i.e. RStudio, PyCharm).

NOTE: based on these criteria, the following packages that are currently in the CRAN VM image list belong instead on the CRAN mirror: ape, bedr, COMBAT, fmsb, quantmod, Seurat, SnowballC, surveillance

thobson88 commented 4 years ago

The other decision, of whether to accept or reject a package, has (at least theoretically) security implications as there's no way to guarantee that a package does not contain malicious code. It's already been considered in #312.

As noted in that ticket, the whitelisting of a particular package does not mean that it's immediately included in either the VM image or the repository mirror. Instead, it's added to the list of packages that will be approved by default in the event that they are requested by a user.

But I think we need to revisit the six criteria originally proposed in #312, because they would exclude some well-established packages and some of them (points 2. & 3.) seem pretty arbitrary and don't add any obvious benefit.

We ought to be able to justify the criteria, so the question is: what makes open source software trustworthy? I'd argue that either of these are reasonable grounds:

plenty of people have had plenty of time to use (and scrutinise) it, or
it's signed by trusted author(s).

Given that most packages on CRAN and PyPI are not digitally signed, we should focus on the first of these. This could be judged based on some combination of the following metadata (available from the package repository):

date of first publication
date of publication of current version
download statistics since those two dates

We could compute a measure of the "time spent in use", measured in days or weeks. For each each week since publication, we multiply the number of downloads in that week by the number of weeks that have passed since, and then sum all the results. We do this separately for the package and its current version, and if either exceeds some agreed threshold, then we add the package to the "approve-by-default" whitelist.

@jemrobinson any thoughts?

jemrobinson commented 4 years ago

I really like your criteria for inclusion in the base image. I think we should add these to the docs directory and try to adhere to them when considering whether to add new packages (and possibly remove some old ones if anyone has time to look back over what we currently install).

I also completely agree with:

plenty of people have had plenty of time to use (and scrutinise) it
signed by trusted author(s)

as a sensible metric for whitelisting.

There's then a question of whether this can easily be turned into a computable metric, which is I think where #312 got stuck. If there is a sensible way to compute this (and therefore possibly automate the decision) then that's great, but otherwise we could keep these as criteria that admins should consider before adding new packages.

We should note that although we have complete control over which packages are included in Turing deployments, that's not true in general and the admins for other deployments might have different ideas about what to whitelist or not.

Do you have any further thoughts on this @martintoreilly?

thobson88 commented 4 years ago

There's then a question of whether this can easily be turned into a computable metric

Here's the "days in use" metric described above, computed for all the packages in the current CRAN whitelist (all versions, during the last year), in a histogram with log scale: hist_log10_days_in_use_1yr

Assuming we can get this data for the other package repos, could we use it pick a threshold for "days in use" above which a package is whitelisted-by-default?

martintoreilly commented 4 years ago

We could compute a measure of the "time spent in use", measured in days or weeks. For each each week since publication, we multiply the number of downloads in that week by the number of weeks that have passed since, and then sum all the results.

I think this is a decaying proxy for days in actual use. An obsolete package that was massively popular 10 years ago will have a high score that will keep increasing, even if no-one has downloaded it for a decade. It feels we ideally want a consistent level of downloads, stretching to relatively recently. If we weren't worried about having a machine computable metric to threshold / weight in an automated decision or quality score for a packages, would we really just want to eyeball a plot of downloads over time chart for each package?

Are there other ways we can test security more directly (e.g. checking CVEs for package versions, static analysis etc)?

Do we want to consider new versions of packages as inheriting all the quality metrics accumulated over all previous versions? I can see cases where the new package might be much less trusted (massive refactor) or much more trusted (fixes high risk security bug)?

martintoreilly commented 4 years ago

@thobson88 Where did you get the download data from?

JimMadge commented 4 years ago

Tangentially related @jemrobinson and I have started building a core/white-listed packages document from the scratch here https://hackmd.io/@nHslnPpLRmCxPOmQBcOW-g/SJa8Em4oI due to the (probably necessarily) large size of the current lists.

jemrobinson commented 4 years ago

@thobson88 Maybe reversing your recency weighting would help (ie. a download 1 yr ago is worth less than a download yesterday)? Possibly exponentially? Something like ndownloads * exp(-A * ndays) rather than ndownloads * ndays inside your integral? This should deal with @martintoreilly 's point.

thobson88 commented 4 years ago

@thobson88 Where did you get the download data from?

I used the cran_downloads function from the cranlogs package. Currently seeking equivalents for the other repositories.

martintoreilly commented 4 years ago

For PyPI, download stats can be accessed from the Google BigQuery PyPI stats tables. The PyPI stats API directs to the Google tables for bulk operations.

martintoreilly commented 4 years ago

Vulnerability checking

Vulnerability databases

NHS Digital Data Security Centre Cyber Alerts (formerly CareCERT collect - The CareCERT page redirects to the NHS Digital Cyber and Data Security pages)
CVE
NVD
VulnDB
Snyk
- Doesn't cover R or Julia
- Access to database and API requires paid subscription
Vera Code (was Source Clear)
Safety
- Python only.
- We already run this on each Conda environment containing our core packages during the DSVM image build)
- An open version of the Safety database is at https://github.com/pyupio/safety-db/ as a JSON file with a description of each vulnerability and a CVE reference where one exists.
Gitlab Gemnasium
- An open version of the Gemnasium database is at https://gitlab.com/gitlab-org/security-products/gemnasium-db
Vulnerability scanners

Python

Bandit. This is the tool Gitlab CI supports for Static Application Security Testing (SAST) of Python code.
Jake (blog). From Sonatype - same folk who make the Nexus package repository proxy with caching and blacklisting. Runs against conda.
RATS (also does C/C++)
Pyntch? (more of a runtime error detector)
Packagr. Package security scanning with paid tier.
OWASP ZA Proxy. This is the tool used by GitLab's Auto DAST check.

General

Sonatype App Scan
Metasploit? There is also a free course
List at https://softwarerecs.stackexchange.com/questions/27757/a-tool-for-finding-third-party-vulnerabilities
OWASP DependencyCheck [docs]
OpenVAS. More for webapps. Could run against Firewall / RDS.
Github AutoDevOps. See testing stages for more detail. Supports Static and Dynamic Application Security Testing (SAST/DAST) and dependency scanning for multiple languages. These features require subscription to Gitlab's "Gold/Ultimate" tier, which is $99/mo per user. However, a few thousand dollars a year for a few people from dev / IT teams to access would be worth it if this covers our needs.

jemrobinson commented 4 years ago

What are we worried about?

Package typo squatting (getting a malicious package instead of the intended one)
Exploiting the available compute for purposes unrelated to the data science problem
Privilege escalation attacks that allow local users to gain root access (this is why we do not currently support Docker) or access to restricted groups at hand (eg. using the Azure compute resource for mining bitcoin [obviously this specific example isn't a good one!])

What are we not (so) worried about?

Attacks that target webservers (we will not be exposing any of these outside of the SRE)
Attacks that get the user to run arbitrary code (Safe Haven users are unprivileged)

What are the risks for the Safe Haven?

At a deeper level the main worries are:

approved users having access to data that they shouldn't (eg. from data mixing)
unapproved users having access to data (eg. from a data breach)

Which types of actors are we worried about?

User who wants to "just get it done" and install something randomly found online without considering security implications or more appropriate alternatives.
Administrator mistakenly adding a package that has not gone through the approval process due to lack of clarity on the process or the expected documentation supporting approval.

Which types of actors are we not worried about?

Malicious administrator (covered by existing access controls and logging)
Malicious developer (covered by existing access controls and logging

Conclusions: It seems like we want to push the burden of decision making onto the person making the request.

Policy

@martintoreilly 's thoughts:

update compute image every month
check CVE databases for all whitelisted packages every month
- ideally blacklist specific versions of packages (check whether this works)
require users to justify why they want additional packages to be whitelisted
- decision made by one or more decision makers (could be data provider representative or delegated)
- biased towards saying 'yes' to additions that will improve research productivity

Questions for request-makers

Is this package the mostly widely supported way to do the thing you want to do?
What will you be able to do with this package that you can't currently do? What alternatives are there?
What risks to data integrity/security might arise from including this package?

Summary statistics for decision makers

Number of downloads in last month (or a graph like @thobson88 showed)
Number of contributors
List of known vulnerabilities

thobson88 commented 4 years ago

@jemrobinson I've written up a first draft in PR #671.

martintoreilly commented 4 years ago

Existence on other default package lists

I think we should add whether a package has been included in other supported package lists as part of our quality assurance signal. This surely says something about how widespread useful something is?

Anaconda: Has a list of included packages. There is a separate list per OS/Python version combination, but the list includes Linux packages. A subset of the packages are marked "In installer", which might be a signal that they are considered more widely used / core.
- Linux/Python 3.6: 682 packages (304 "in installer")
CoCalc: From SageMath. The documentation says "The default environment is very large, well tested, regularly maintained and matured over many years. This is what a project runs by default.". Preinstalled:
- Python
  - Python 2: 618 packages
  - Python 3: 1051 packages
  - SageMath: 496 packages
  - Anaconda: 511 packages
- R
  - R-project: 4871 packages
  - SageMath R: 449 packages
- Linux: 212 packages
- Julia: 431 packages
- Octave: 35 packages

martintoreilly commented 4 years ago

PyPI malware checks

As of Feb / March 2020 the Warehouse repository backing PyPI has had tooling in place to run malware checks on package upload, automated schedule and admin manual trigger.

I looks like the hooks are there, but it's not clear what, if any, real checks are running in production.

Another part of the same work is looking to incorporate The Update Framework (TUF) for more secure package updates.

martintoreilly commented 4 years ago

I found an R package CVE - https://www.cvedetails.com/cve/CVE-2008-3931/

JimMadge commented 2 years ago

Closing as stale and open-ended. This would be better placed in a discussion until we have a concrete proposal.

alan-turing-institute / data-safe-haven

Process for adding python/R/Julia/Ubuntu/etc. packages #622

Stages

Vulnerability checking

Vulnerability databases

Vulnerability scanners

Python

General

What are we worried about?

What are we not (so) worried about?

What are the risks for the Safe Haven?

Which types of actors are we worried about?

Which types of actors are we not worried about?

Policy

Questions for request-makers

Summary statistics for decision makers

Existence on other default package lists

PyPI malware checks