Install additional Python packages #146

Closed cathiest closed 5 years ago

cathiest commented 5 years ago

Let's follow the process in #101 / #87 as to whether this should be installed after everyone leaves at 6pm for the social!

jemrobinson commented 5 years ago

Looks like the packages from the default conda package list aren't installed by default, so we need to specify the full list of required packages more carefully. @martintoreilly do we have a list somewhere?

jemrobinson commented 5 years ago

Current python 2.7 packages

jemrobinson commented 5 years ago

Current python 3.5 packages

vollmersj commented 5 years ago

Below might be useful

Cython bisect linecache secrets IPython bleach locale select PyQt5 builtins logging selectors future bz2 lzma send2trash _ast cProfile macpath setuptools _asyncio calendar macurl2path shelve _bisect certifi mailbox shlex _blake2 cgi mailcap shutil _bootlocale cgitb markupsafe signal _bz2 chardet marshal simplegeneric _codecs chunk math sip _codecs_cn cmath matplotlib sipconfig _codecs_hk cmd mimetypes sipdistutils _codecs_iso2022 code mistune site _codecs_jp codecs mmap sitecustomize _codecs_kr codeop modulefinder six _codecs_tw collections multiprocessing sklearn _collections colorsys nbconvert smtpd _collections_abc compileall nbformat smtpd2 _compat_pickle concurrent netrc smtplib _compression configparser networkx sndhdr _crypt contextlib nis snowballstemmer _csv copy nntplib socket _ctypes copyreg nose socketserver _ctypes_test crypt notebook sphinx _curses csv ntpath sqlite3 _curses_panel ctypes nturl2path sre_compile _datetime curses numbers sre_constants _dbm cycler numpy sre_parse _decimal cython numpydoc ssl _dummy_thread cythonmagic opcode stat _elementtree datetime operator statistics _functools dateutil optparse storemagic _gdbm dbm os string _hashlib decimal packaging stringprep _heapq decorator pandas struct _imp difflib pandocfilters subprocess _io dis parser sunau _json distutils parso symbol _locale doctest past sympyprinting _lsprof docutils pathlib symtable _lzma dummy_threading patsy sys _markupbase easy_install pdb sysconfig _md5 email pexpect syslog _multibytecodec encodings pickle tabnanny _multiprocessing ensurepip pickleshare tarfile _opcode entrypoints pickletools telnetlib _operator enum pip tempfile _osx_support errno pipes terminado _pickle event_rpcgen pkg_resources termios _posixsubprocess faulthandler pkgutil test _pydecimal fcntl platform testpath _pyio filecmp plistlib tests _random fileinput poplib textwrap _scproxy fnmatch posix theano _sha1 formatter posixpath this _sha256 fractions pprint threading _sha3 ftplib profile tick _sha512 functools prompt_toolkit time _signal future pstats timeit _sitebuiltins gc pty tkinter _socket genericpath ptyprocess token _sqlite3 getopt pwd tokenize _sre getpass py_compile tornado _ssl gettext pybasicbayes tqdm _stat glob pyclbr trace _string grp pydoc traceback _strptime gzip pydoc_data tracemalloc _struct h5py pyexpat traitlets _symtable hashlib pygments tty _sysconfigdata_m_darwin_darwin heapq pyhawkes turtle _testbuffer hmac pylab turtledemo _testcapi html pymc types _testimportmultiple html5lib pymc3 typing _testmultiphase http pyparsing unicodedata _thread idlelib pytz unittest _threading_local idna pyximport urllib _tkinter imagesize qtconsole urllib3 _tracemalloc imaplib queue uu _warnings imp random venv _weakref importlib re warnings abc inspect readline wave aifc io reprlib wcwidth alabaster ipaddress requests weakref antigravity ipykernel resource webbrowser appnope ipykernel_launcher rlcompleter webencodings argparse ipyparallel rmagic wheel array ipython_genutils rst2html widgetsnbextension ast ipywidgets rst2latex wsgiref asynchat itertools rst2man xdrlib asyncio jedi rst2odt xml asyncore jinja2 rst2odt_prepstyles xmlrpc atexit joblib rst2pseudoxml xxlimited audioop json rst2s5 xxsubtype autograd jsonschema rst2xetex zipapp autoreload jupyter rst2xml zipfile babel jupyter_client rstpep2html zipimport base64 jupyter_core runant zlib bdb keyword runpy zmq bin lib2to3 sched
binascii libfuturize scipy
binhex libpasteurize seaborn

jemrobinson commented 5 years ago

Current python 3.6 packages

jemrobinson commented 5 years ago

Other specific requests: monocle (python) and seurat (R)

@cathiest - can you check this one? It seems more likely that we want this R package ( than this python package (

ornithos commented 5 years ago

(python) The current packages don't look too bad. Following the Environment Design document, a very substantial omission (imo 'IMPORTANT') appears to be pandas, less important ('DESIRED') would be seaborn and statsmodels, and there may be others -- I haven't done a full cross-check with the default conda package list, but I know many who use these two (incl. myself). I wonder whether pytorch works without the pytorch module? torch/torchvision are there, which I think are the important ones, but may be worth testing this.

martintoreilly commented 5 years ago

Anaconda package selection strategy

Building a list of standard Python packages

  1. Any small set of explicitly requested packages in this issue that are "must haves" (sorry @vollmersj, but your list is long enough that it may cause the environment solve to fail and we need a default set of packages working tomorrow so is on the "nice to have list")
  2. The set of packages with "In installer" ticked on the Anaconda package lists (see lists below)
  3. The set of packages installed on the MS Data Science VM (get using pip freeze)
  4. The set of packages on @ornithos's normal Python environment

We will include all sources above and give priority to sources further up the list (i.e. lower numbered).

ornithos commented 5 years ago

Must have packages from my various environments are:

{pytorch, torchvision, torch}

and desirable packages:

jemrobinson commented 5 years ago

Hmm, that's weird @ornithos - pytorch is in the list of stuff we're explicitly asking conda to install. The lists above came from pip, so maybe this is just a conda vs. pip discrepancy? Anyway, if you look at, you can see the current status of the (much larger) list of packages we're installing for future builds. Let us know if you see anything missing there.

jemrobinson commented 5 years ago

FYI @martintoreilly, all of the packages listed by @ornithos are already on the lists except intelpython , plotly and spacy.

darenasc commented 5 years ago

For the NATS challenge the facebook prophet package fbprophet would be a nice one to have.

pip install fbprophet

martintoreilly commented 5 years ago

plotly, spacy are in conda for all pythons 2.7, 3.5 and 3.6 and have been added to the build.

intelpython and fbprophet are not available in conda for any of our supported python versions. For now we are not supporting packages not in conda as part of the standard build, but we may be able to support then as part of a custom deploy for a particular challenge. I will add the fbprophet to the NATS deplot.

martintoreilly commented 5 years ago

Potential fallback (use Python 3.7)

I think that the base environment (which uses Python 3.7) may have the packages ticked as "In installer" on the Anaconda website. I have check in the test environment and pandas appears to be installed at least.

cathiest commented 5 years ago

Hi, report from Team NATS that they need the following packages for the VM. Not sure if they are in your build already. Post a thumbs up if they are?

jemrobinson commented 5 years ago

Following Zoom discussion just now, can we clarify the minimal list of python packages that we absolutely have to install on top of what we already have? @fkiraly @cathiest @jamespjh @martintoreilly ? The smaller the better from the perspective of testing that everything works as expected.

martintoreilly commented 5 years ago

I see some packages are also being logged in issue alan-turing-institute/DSG-Dec18-issuelog/issues/6

martintoreilly commented 5 years ago

EDIT: Realised #147 is the pull request and consolidation should happen in this issue.

I think we should try to include the Anaconda "In installer" packages, but could easily be persuaded we should start with a minimal list first to at least get something deployed. The risk of this approach is we will then be deploying a third environment for users to migrate to.

I'd suggest a minimal explicit list from all DSG groups first, then try adding the Anaconda "in installer" packages while @fkiraly etc are testing the minimal set? This should give us the option to re-test the explicit list in the larger list VM and deploy that if there are no regressions.

martintoreilly commented 5 years ago

Updated the title of this issue to reflect wider scope

jemrobinson commented 5 years ago

I vote for starting with a minimal list and then expanding the scope afterwards. From I have come up with: basemap bokeh fbprophet geopandas gpflow keras matplotlib numpy pandas pandas_profiling scikit-learn seaborn tensorflow tsfresh, some of which are already available in the currently deployed VMs.

cathiest commented 5 years ago

From Alex Bird in

I've collated the above lists and given my best assessment of categories:


numpy pandas sklearn

  • matplotlib IMPORTANT & Easy

geopandas pandas_profiling tsfresh basemap seaborn IMPORTANT & can have compatibility issues / sometimes challenging to set up

gpflow fbprophet pystan #(fbprophet depends anyway) tensorflow keras NICE TO HAVE

bokeh pytorch #(surprised nobody's asked for this, but I guess tf is there)