Decide how to whitelist R and Python packages for Tier 3 and higher

jamespjh commented 5 years ago

The protocol should be quick - ideally even scriptable.

I propose packages should:

Have version control on public GitHub or Gitlab
- Measure: Yes/No
Have commits within the last year
- Measure: Number of commits per month for the past 36 months
Have commits by at least three different contributors
- Measure: Number of distinct committers in past month, 3 months, 6 months, 12 months, 24 months, 36 months
Be on a recognised package repository (PyPI, Conda, CRAN, Bioconductor.)
- Measure: Name of package repository plus release date and version number of all releases if available
Have LICENSE and README files (always true if on repository?)
- Measure: Yes/No (plus name of licence if available/inferable, length of README in words and possibly content of LICENSE and README files - Note: Look for any files names LICENSE/LICENCE or README, regardless of suffix (or lack of)
The lead contributor should have an email address (always true if on repository?)
- Measure: Yes/No plus capture of actual email address

These requirements need to be satisfied for each package and their full dependency graph.

Could give other organisations delegated / transitive trust - e.g.

Anaconda
Enthought
MRAN?

Let's research this further. Initially let's script the above criteria and see what survives from PyPI and CRAN. The script should capture the measures above (as should any parallel manual review).

For the current Tier-3 projects:

Assume a small core of necessary packages are safe and then run the above process?

TODO:

Script the above checks for Python. Check satisfaction for PyPI and Anaconda packages (see PR #325).
Check if there are any packages we can't avoid installing without making significant changes to the compute VM build.
Collate the list of known incidents of exploiting dependencies to check which (if any) of the above criteria would protect agains them.
Determine what data to collect to apply evolving criteria to and share with Sebastian to allow some manual investigation to occur in parallel

martintoreilly commented 5 years ago

@jamespjh These requirements would definitely only match a subset of PyPI, Conda, CRAN, Bioconductor etc and I agree we should have a confidence providing process that can actually be achieved as opposed to an assurance process that cannot. However, I would also suggest we run (or rely on someone else's) automated security scan to flag a subset of more obvious / easily detectable security issues.

[Edit: removed Tier 3+ question as answered in comment in PR #304 referencing this issue. I've updated the title to reflect this precess is for Tier 3 and above]

jamespjh commented 5 years ago

We should task someone to script up my criteria above, and find out what % of packages in PyPI meet it, and in particular, what % of our packages in our required list meet it. Sounds like a fun task.

I might do it on Thursday afternoon if I get time.

tomdoel commented 5 years ago

@martintoreilly @jamespjh There are really two tasks here; to implement/run the scripts to see what happens; and then to decide how to revise the protocol according to the results. Do both need to be complete for the DSSG?

martintoreilly commented 5 years ago

I think that is somewhat dependent on the results of the automated checks. We will need to take a whitelisting decision on the packages required for the DSSG Tier 3 projects regardless, but the number of packages we need to manually review will depend on the automatic acceptance criteria.

martintoreilly commented 5 years ago

@jamespjh @tomdoel I'm talking to @darenasc about this issue this afternoon. He may have some time to work on it. @jamespjh Let me know if you start working on it so we don't duplicate effort.

martintoreilly commented 5 years ago

@darenasc @tomdoel @jamespjh I've updated the criteria with non-binary measures I think we should capture from our automated crawl of the package repositories to allow us to explore the impact of tweaking the criteria.

martintoreilly commented 5 years ago

@darenasc Will take a first pass at this from tomorrow (Friday) and should have an initial output for Monday. He'll initially target the most popular Python packages and produce:

A list of the "input" packages with: a. A Yes/No for each criteria for each input package b. A list of dependencies for each input package (@darenasc if this is tricky we could do without it for the first pass) c. A Yes/No for whether all dependencies satisfy each criteria for each input package (@darenasc I think this could just be a single Yes/No flag for "all criteria satisfied by all dependencies" for each input package).
A list of all dependencies encountered across all input packages with: a. A Yes/No for each criteria for each dependency

@darenasc As discusses, please open a draft PR for this and check in your code as you go. We can have any required detailed conversation on the implementation there.

jemrobinson commented 5 years ago

Decision:

implement separate internal/external mirrors with a single whitelist for all Tier-3 environments that will (initially) be the set of package + dependencies that we're already installing (see lists at https://github.com/alan-turing-institute/data-safe-haven/tree/master/new_dsg_environment/azure-vms/package_lists).
to go into Safe Haven Managment production subscription
the active Tier 3 whitelist can be easily updated by a scripted redeploy of the mirrors with an updated whitelist.

martintoreilly commented 5 years ago

@jamespjh and I just had a chat about what we do with the output of the whitelist criteria evaluation for packages. Our view is that we should not proactively add all packages on e.g. PyPI that meet the criteria for whitelisting. Instead, we should wait for packages to be requested and use the fact it meets the whitelisting criteria to allow us to default to approve unless we feel a particular package (or pattern of package requests) needs further investigation. This allows us to support a very fast turnaround for uncontentious packages (which should be the vast majority), while being able to reassure data providers that there is a "person in the loop" on all whitelisting decisions. It also lets us evolve our whitelisting criteria as we get a feel for how well they are met by a range of mainstream and more boutique packages.

martintoreilly commented 5 years ago

From @jamespjh via email:

Part of our automated checks for the DSH dependency tree should check if the license is one of those in the OSI approved list — from the Redhat workshop.

martintoreilly commented 4 years ago

@ots22 @nbarlowATI @edaub @myyong @edwardchalstrey1 @jack89roberts Thoughts?

JimMadge commented 2 years ago

Closing as part of a stale issue cleanup.

alan-turing-institute / data-safe-haven

Decide how to whitelist R and Python packages for Tier 3 and higher #312