Implement Tier-3 whitelist

jemrobinson commented 5 years ago

Following on from #312, we need to implement a whitelisted set of package mirrors. Initially this will be

a single set of internal/external mirrors with a given whitelist
part of the Safe Haven Managment production subscription
added to the documentation on how to deploy a DSG group

martintoreilly commented 5 years ago

@jemrobinson As the whitelisting requires a full list of each top-level packages dependencies to be added, can we do the following for the initial DSSG compute environments?

Install the additional Python packages @vollmersj will provide as part of the build for our base compute VM (or in the deployment cloud.init)
Add the Ubuntu packages @DavidBeavan will provide for LwM?

@DavidBeavan @vollmersj Please can you paste the canonical list of additional packages you need when the safe haven environments are first deployed in a comment below.

DavidBeavan commented 5 years ago

Ubuntu packages

For extract_text bash version:

apt-get install xsltproc (deps=)
apt-get install parallel (deps=libsensors4, sysstat)

Canonical list

xsltproc
parallel
libsensors4
sysstat

Python packages

For LwM Mike's extract_text tool:

lxml>=3.7.3
pep8>=1.7.0
pylint>=1.6.4
pycodestyle>=2.5.0
PyYAML>=3.13
regex>=2018

DavidBeavan commented 5 years ago

Ping @kasra-hosseini and @thobson88 for more intel

martintoreilly commented 5 years ago

We'd like any and all packages required please (Ubuntu, Python and otherwise).

jamespjh commented 5 years ago

Do we have a code snippet for how to generate dependency expansions for a python package?

jamespjh commented 5 years ago

Ooh:

pip install pipdeptree

pipdeptree

jamespjh commented 5 years ago

Example of pipdeptree output:

pyOpenSSL==17.5.0

cryptography [required: >=2.1.4, installed: 2.1.4]
- asn1crypto [required: >=0.21.0, installed: 0.23.0]
- cffi [required: >=1.7, installed: 1.11.2]
  - pycparser [required: Any, installed: 2.18]
- idna [required: >=2.1, installed: 2.6]
- six [required: >=1.4.1, installed: 1.11.0]
six [required: >=1.5.2, installed: 1.11.0]

jamespjh commented 5 years ago

Another example:

spacy==2.1.4
  - blis [required: >=0.2.2,<0.3.0, installed: 0.2.4]
    - numpy [required: >=1.15.0, installed: 1.16.4]
  - cymem [required: >=2.0.2,<2.1.0, installed: 2.0.2]
  - jsonschema [required: >=2.6.0,<3.1.0, installed: 2.6.0]
  - murmurhash [required: >=0.28.0,<1.1.0, installed: 1.0.2]
  - numpy [required: >=1.15.0, installed: 1.16.4]
  - plac [required: >=0.9.6,<1.0.0, installed: 0.9.6]
  - preshed [required: >=2.0.1,<2.1.0, installed: 2.0.1]
    - cymem [required: >=2.0.2,<2.1.0, installed: 2.0.2]
  - requests [required: >=2.13.0,<3.0.0, installed: 2.18.4]
    - certifi [required: >=2017.4.17, installed: 2018.1.18]
    - chardet [required: >=3.0.2,<3.1.0, installed: 3.0.4]
    - idna [required: >=2.5,<2.7, installed: 2.6]
    - urllib3 [required: >=1.21.1,<1.23, installed: 1.22]
  - srsly [required: >=0.0.5,<1.1.0, installed: 0.0.6]
  - thinc [required: >=7.0.2,<7.1.0, installed: 7.0.4]
    - blis [required: >=0.2.1,<0.3.0, installed: 0.2.4]
      - numpy [required: >=1.15.0, installed: 1.16.4]
    - cymem [required: >=2.0.2,<2.1.0, installed: 2.0.2]
    - murmurhash [required: >=0.28.0,<1.1.0, installed: 1.0.2]
    - numpy [required: >=1.7.0, installed: 1.16.4]
    - plac [required: >=0.9.6,<1.0.0, installed: 0.9.6]
    - preshed [required: >=1.0.1,<2.1.0, installed: 2.0.1]
      - cymem [required: >=2.0.2,<2.1.0, installed: 2.0.2]
    - srsly [required: >=0.0.5,<1.1.0, installed: 0.0.6]
    - tqdm [required: >=4.10.0,<5.0.0, installed: 4.32.1]
    - wasabi [required: >=0.0.9,<1.1.0, installed: 0.2.2]
  - wasabi [required: >=0.2.0,<1.1.0, installed: 0.2.2]

kasra-hosseini commented 5 years ago

Python packages that are currently used in our LwM notebooks:

spacy
spacy_langdetect
gensim
nltk
mpi4py
psycopg2 (to access our Postgres DB on Azure, I don't think we need this on DSH)
pandas
matplotlib
numpy

jamespjh commented 5 years ago

We will need psycopg2, because we’ll be instantiating another database inside the DSH with a complete full text extract.

(So that code first authored against the DB with a subsample can then be reused against the full dataset with minimal change.)

kasra-hosseini commented 5 years ago

@jamespjh Thanks.

Here are some other python packages that Giovanni used to ingest metadata/text into the DB:

pg8000
bs4, BeautifulSoup
multiprocessing

jemrobinson commented 5 years ago

Let's keep this issue about the Tier-3 python whitelist. Discussion about packages needed by Living with Machines can go on #337 .

jemrobinson commented 5 years ago

NB. @jamespjh : pipdeptree only works for installed packages. I'm not sure we want to install all the packages from our whitelist (so that we can list their dependencies) every time we deploy a compute VM.

martintoreilly commented 5 years ago

@jemrobinson https://libraries.io looks like it has dependency information. @darenasc is using this in PR #325 to get information about whether packages meet our proposed "default to approve" whitelist criteria, though he hasn't done anything we dependency gathering yet.

martintoreilly commented 5 years ago

@jemrobinson @jamespjh Any option that relies on a package being installed means that we are only guaranteed to get the dependencies of the installed version of a package. If dependencies have changed over time we may not catch all dependencies required for all versions of a package.

Considering this, we probably want to evaluate each version of a package against our whitelist criteria. @jemrobinson Can we whitelist by version using bandersnatch?

darenasc commented 5 years ago

I'm planning to push the dependency workflow soon. Taking the list of packages from the dependency column, and run the validation on those packages and on its dependencies iteratively.

jemrobinson commented 5 years ago

OK, we need a decision very soon (today if possible) about which packages to include in the whitelist. It doesn't have to be the final word, but it should be a reasonable first guess, since changing it will involve redeploying the package mirrors. Will this be possible @darenasc ?

darenasc commented 5 years ago

Yes @jemrobinson, working on that now. Will push asap.

jemrobinson commented 5 years ago

Hi @darenasc - any update on this? Just to make sure we're on the same page - for this week, we only want a list of all python packages that we already install plus all of their dependencies - we're not so worried about whether or not they pass the criteria of #312. Are you able to produce this?

darenasc commented 5 years ago

hi @jemrobinson yeah, it is possible.

I'm implementing the dependencies tree for each package, I think that info is what you need.

Do you have a list with the package names? Otherwise, it will start with the list of packages from Anaconda. The output is an excel file with one row per package.

jemrobinson commented 5 years ago

All the files in here: https://github.com/alan-turing-institute/data-safe-haven/tree/master/new_dsg_environment/azure-vms/package_lists

darenasc commented 5 years ago

hi @jemrobinson this script generates the same package lists but adding dependencies https://github.com/alan-turing-institute/data-safe-haven/tree/312-external-packages-validation/new_dsg_environment/azure-vms/package_lists

The files are named *-with-dependencies.list. In the list, the packages with \tab are the dependencies of the previous package without \tab.

For example, in this case numpydoc requires Jinja2 and sphinx, but Jinja2 requires MarkupSafe so all three are included.

numpydoc
    Jinja2
    MarkupSafe
    sphinx

martintoreilly commented 5 years ago

Thanks @darenasc. I believe that some packages have different names on PyPI than on Conda. How are you handling this?

martintoreilly commented 5 years ago

@darenasc Are you resolving dependencies for all versions of each package available on PyPI, just the most recent version or something else? We don't need all versions for the initial whitelist I don't think.

darenasc commented 5 years ago

I'm querying libraries.io/pypi directly with the package name. I extract the data if the page exists. It doesn't have any other filters. So it's checking the information from PYPI only.

An extension could be querying GitHub https://help.github.com/en/articles/listing-the-packages-that-a-repository-depends-on checking for a requirements.txt or querying the dependency graph from the api.

darenasc commented 5 years ago

It's checking the dependencies for the latest version. The versions can be included to the list of packages with dependencies.

jemrobinson commented 5 years ago

@darenasc : that's great. In fact we don't need one *-with-dependencies.list for each of those lists, but only one combined one. Maybe this can cut down on the number of checks you have to do, since you can concatenate all the lists into one master list before running the dependency check?

martintoreilly commented 5 years ago

Would it be easy to use the version for each dependency when checking its dependencies? That way we would at least be sure that the latest versions of the top-level packages in the whitelist would install.

Also, where does the dependency resolution info come from? Does it use data from both requirements.txt and setup.py? Does it matter if it doesn't?

darenasc commented 5 years ago

@jemrobinson sure, I'll query all the dependencies from one concatenated list. I can push that tonight.

darenasc commented 5 years ago

Would it be easy to use the version for each dependency when checking its dependencies? That way we would at least be sure that the latest versions of the top-level packages in the whitelist would install.

@martintoreilly actually to get the dependencies form the libraries.io API I need the version so first I get the latest version of the library and then query the API for dependencies. Yeah I can add the version per package to the output file.

Also, where does the dependency resolution info come from? Does it use data from both requirements.txt and setup.py? Does it matter if it doesn't?

The dependency data is coming from the libraries.io API https://libraries.io/api#project-dependencies I don't know how that information is generated. I'm not querying any content of files in the repository atm to keep it simple.

jemrobinson commented 5 years ago

OK, I've taken @darenasc's lists and manually combined them to give the following:

Full list of packages from the conda/pip lists

``` Jinja2 MarkupSafe NavPy Werkzeug absl-py aero-calc appdirs astor astropy atari-py atomicwrites attrs automat backports backports.functools-lru-cache basemap beautifulsoup4 bitarray bkcharts blaze bleach bokeh bottleneck box2d-py click cloudpickle colorama configparser contextlib2 convertdate cython dask datashape defusedxml distributed dtw entrypoints enum34 ephem fbprophet flask folium funcsigs functools32 gast geopandas gmplot gmpy2 gpy gpyopt graph-tool grpcio gym heapdict holidays html5lib ipykernel ipython ipython-genutils ipywidgets itsdangerous jinja2 jsonschema jupyter jupyter-console jupyter-core jupyter_client jupyterhub jupyterhub-ldapauthenticator jupyterlab keras lunardate lxml markdown matplotlib mistune mock monocle mpi4py mpmath multipledispatch nbconvert nbformat networkx nltk nose notebook numba numpy numpy-base numpydoc pandas pandas-datareader pandas-profiling pandasql pandocfilters paramz pbr pillow pint plotly prompt_toolkit protobuf pyLDAvis pycodestyle pycosat pyglet pygments pygpu pygrib pymc3 pyod pyopengl pyopengl-accelerate pyrsistent pystan pytables pytest python-blosc python-dateutil python-geohash python-louvain pytorch pyyaml pyzmq qtconsole r-irkernel rpy2 scikit-image scikit-learn scipy seaborn setuptools setuptools-git singledispatch six soupsieve spacy sphinx sqlite sympy tensorflow-gpu termcolor testpath torchvision tornado traitlets tsfresh twisted unicodecsv wcwidth webencodings werkzeug wordcloud xlrd xlsxwriter ```

Packages that have a different name on pip

``` automat -> Automat bottleneck -> Bottleneck cython -> Cython datashape -> DataShape flask -> Flask gpy -> GPy gpyopt -> GPyOpt heapdict -> HeapDict ipython-genutils -> ipython_genutils jinja2 -> Jinja2 jupyter-console -> jupyter_console jupyter-core -> jupyter_core keras -> Keras markdown -> Markdown pillow -> Pillow pint -> Pint pygments -> Pygments pyopengl -> PyOpenGL pyopengl-accelerate -> PyOpenGL-accelerate pytables -> tables python-blosc -> blosc pyyaml -> PyYAML sphinx -> Sphinx twisted -> Twisted werkzeug -> Werkzeug xlsxwriter -> XlsxWriter ```

numpy-base is a conda package that does not exist on pip
pygpu is a conda package that does not exist on pip
r-irkernel is an R package (not python)
sqlite is a binary package (not python)

We should re-run this after we've added the packages requested by LwM and DSSG.

martintoreilly commented 5 years ago

From @darenasc: @martintoreilly actually to get the dependencies form the libraries.io API I need the version so first I get the latest version of the library and then query the API for dependencies. Yeah I can add the version per package to the output file.

Great. Can I just confirm that this means we are querying the API for the specific dependency version required by each package (and so on down the dependency tree). i.e. If Numpy requires depA-1.2 and SciPy requires depA-2.3, we end up with the dependencies for both these versions in our exploded depedency list?

martintoreilly commented 5 years ago

@jemrobinson What makes the expandable drop-down arrow in markdown? Is it the <p> html tag?

jemrobinson commented 5 years ago

@martintoreilly: it's the <details> tag. The thing that gets shown is the <summary> tag.

darenasc commented 5 years ago

From @martintoreilly If Numpy requires depA-1.2 and SciPy requires depA-2.3, we end up with the dependencies for both these versions in our exploded depedency list?

For a specific version of a package we are getting the minimal version of its dependency. i.e. Numpy will say depA >=1.2 and SciPy will say depA >=2.3 probably getting the latest version of a dependency, depA-2.3, would be solve most of the problems.

i.e. If we have a package A that depends on package B >=0.1 the script is getting the dependencies for the latest version of package B and so on.

darenasc commented 5 years ago

Packages that have a different name on pip

@jemrobinson I can add those packages to a re run of the script and put the results in one output file.

martintoreilly commented 5 years ago

For a specific version of a package we are getting the minimal version of its dependency. i.e. Numpy will say depA >=1.2 and SciPy will say depA >=2.3 probably getting the latest version of a dependency, depA-2.3, would be solve most of the problems.

i.e. If we have a package A that depends on package B >=0.1 the script is getting the dependencies for the latest version of package B and so on.

Sorry, just to be totally clear. In the above example would we end up with all dependencies that are in either depA v1.2 or depA v2.3?

darenasc commented 5 years ago

Not sure if I get the question but if a package such as depA is required by different packages, they might require different versions >=1.2 (v1.2 or higher) or >=2.3 (v2.3 or higher) in that case, the script will look only for the latest version of depA and check its dependencies, it won't check for dependencies of previous versions such as 1.2 as that is the oldest possible version required and will check for the newest/latest version instead, this should comply the requirement.

There might be some cases where previous versions of depA requiered other dependencies (that are no longer requiered by the latest version), this is not noted by the script, but an exhaustive seach on all the previous versions or subset of them and their dependencies can be implemented and cover more cases if needed. The previous versions are in the API and can be queried separately.

jemrobinson commented 5 years ago

Yes, I think @martintoreilly was asking about the following situation:

packageA (version 1.2) has dependencies

packageB
packageC

packageA (version 1.3) has dependencies

packageB
packageD

In this case we'd want all of packages A, B, C and D on the whitelist.

jemrobinson commented 5 years ago

:warning: This dependency following process did not catch that pyparsing is a dependency of matplotlib. Can you look into how this fell through the gaps, @darenasc ?

darenasc commented 5 years ago

@jemrobinson if packageA doesn't has a version in the list, it will add the dependencies of its latest version, packageB and packageD. Although, is s not a problem to check dependencies for packageA v1.2 if the version is given or appears in the .list file.

@jemrobinson yes, will check the pyparsing case.

darenasc commented 5 years ago

⚠️ This dependency following process did not catch that pyparsing is a dependency of matplotlib. Can you look into how this fell through the gaps, @darenasc ?

I found the reason why it didn't includepyparsing among the dependencies and it's because in libraries.io it is not mentioned as a dependency of matplotlib.

In github https://github.com/matplotlib/matplotlib/network/dependencies pyparsing indeed appears as a dependency.

In github.com the dependency information is parsed from requirements.txt and pipfile.lock and sometimes from setup.py https://help.github.com/en/articles/listing-the-packages-that-a-repository-depends-on

darenasc commented 5 years ago

From @jemrobinson @darenasc : that's great. In fact we don't need one *-with-dependencies.list for each of those lists, but only one combined one. Maybe this can cut down on the number of checks you have to do, since you can concatenate all the lists into one master list before running the dependency check?

In python-packages-plus-dependencies.list there is single list of python packages plus dependencies taken from libraries.io based on the packages in the .list files. It is generated with this script.

jemrobinson commented 5 years ago

@darenasc : this list still doesn't have pyparsing. Are you using libraries.io to generate the list? If so, maybe this needs to be expanded to use the requirements.txt information as well?

darenasc commented 5 years ago

@jemrobinson yes, I'm planning to get dependencies from github as well.

To do that we need to:

Get the github url of the repo from libraries.io, anaconda packages.
Get the last commit and sha key
Get the tree from the sha key of the last commit
Get the requierements.txt, pipfile.lock and setup.py files from the repo if they exist.

And then parsing them. Should be straightforward but requires a bit of time to implement, I'll try to have it asap.

vollmersj commented 5 years ago

Why don't we just mirror the package that satisfy criteria if dependcies don't work out though look

Sebastian Vollmer

Send from phone

From: Diego Arenas notifications@github.com Sent: Thursday, June 13, 2019 5:32:38 PM To: alan-turing-institute/data-safe-haven Cc: Vollmer, Sebastian; Mention Subject: Re: [alan-turing-institute/data-safe-haven] Implement Tier-3 whitelist (#328)

@jemrobinsonhttps://github.com/jemrobinson yes, I'm planning to get dependencies from github as well.

To do that we need to:

Get github url of the repo from libraries.io, anaconda packages.
Get last commit and sha
Get tree of the sha of the last commit
Get requierements.txt, pipfile.lock and setup.py from the repo.

And then parsing them. Should be straightforward but requires a bit of time to implement, I'll try to have it asap.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/alan-turing-institute/data-safe-haven/issues/328?email_source=notifications&email_token=ADAHMB72BAALQ5YJX2Q2EP3P2JZCNA5CNFSM4HTWKTP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXUIQZA#issuecomment-501778532, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADAHMBY2BHIHYAXVUC2UZVLP2JZCNANCNFSM4HTWKTPQ.

darenasc commented 5 years ago

@jemrobinson @vollmersj there is a shorter way to get the dependencies https://libraries.io/api#repository-dependencies it shouldn't be a problem to expand the current script.

jemrobinson commented 5 years ago

Remaining issues in the whitelist

duplicated packages

ipython_genutils and ipython-genutils jupyter_console and jupyter-console jupyter_core and jupyter-core

non-existent packages

numpy-base is not a pip-installable package pygpu is not a pip-installable package pytables is not the package name on pip python-blosc is not the package name on pip r-irkernel is not a python package sqlite is not a python package

martintoreilly commented 5 years ago

@vollmersj Just a reminder that this isn't a blocker for the requested DSSG packages as all packages on the current whitelist will be preinstalled on the compute VM.

We therefore have some time to get this to the point where we can reasonably expect most packages added to the whitelist to be successfully installable from the Tier 3 whitelisted mirror (including their dependencies for at least the latest version).

darenasc commented 5 years ago

@jemrobinson this python-final.list list contains the original packages and all its dependencies in pip, including the ones with different names.

This is the script that searches iteratively. From 180 initial packages stops at iteration 7 with 792 packages when there are no new packages among the dependencies of dependencies.

alan-turing-institute / data-safe-haven