anaconda / anaconda-project

Tool for encapsulating, running, and reproducing data science projects
https://anaconda-project.readthedocs.io/en/latest/
Other
221 stars 88 forks source link

Templating for commands #244

Closed jlstevens closed 4 years ago

jlstevens commented 4 years ago

The primary goal of this PR is to highlight some pain points in the current design of anaconda-project and to propose one approach that makes anaconda-project more flexible for users and easier to maintain for maintainers (with an example of solving a concrete problem).

The problem

Currently anaconda-project supports three types of command: unix, notebook and bokeh. The unix command is the simplest of these, allowing any command to be run based on the string in the anaconda-project.yml. If any extra arguments are supplied to anaconda-project run <command>, these are simply appended to the command.

The bokeh and notebook commands are aimed to make two particular use cases easier. These commands run servers that need extra information (e.g allowed ports) in order to work correctly across environments. If supports_http_options is True, these commands convert additional arguments (namely -anaconda-project-host', --anaconda-project-address, --anaconda-project-port, --anaconda-project-url-prefix, --anaconda-project-no-browser, --anaconda-project-iframe-hosts and --anaconda-project-use-xheaders) into the appropriate format needed to run the bokeh or notebook server respectively.

Unfortunately, this approach of explicitly supporting different types of tool in anaconda-project itself is not maintainable long term. For instance, the bokeh command only works for fairly old versions of Bokeh (around 0.6.7?) as the command-line interface for specifying these options has since changed. Even if the _BokehArgsTransformer class were updated accordingly to work with recent versions of Bokeh, then older versions of bokeh would no longer be supported as a result. For general support across versions, each version of each command line interface for each supported tool (i.e notebook and bokeh) would need to have its own unique command name for use in anaconda-project.yml.

Right now, there are several ways you can hope to tackle this problem:

  1. Add a command for each command-line interface. This seems quite unwieldy and each time a command-line interface changes (or a new type of command is added), users need to wait for a new anaconda-project release to be available and installed in all contexts where it needs to be used. It is also a burden for the maintainers of anaconda-project to support all the various tools.

  2. Assume that the tools themselves will support the arguments supplied by anaconda-project. This is how support for older versions of Bokeh operated but in general it is unlikely that external projects will decide to support the necessary flags. As a concrete example, running a voila server requires the extra HTTP options, but instead of --anaconda-project-port, the required argument is --port. It is unlikely that voila will ever support --anaconda-project-port, directly.

  3. Use a wrapper script or a command-line trick to munge the arguments into the required format. The appeal of this approach is that users can use this workaround at any time without waiting for any software to be released. Writing a wrapper script is annoying, especially as it means including an extra file in the project to supply the extra level of indirection needed.

    An example of using a special unix command can be seen in this project running voila server. To get this working, I had to develop a rather ugly and fragile awk command:

    commands:
    dashboard:
    unix: awk 'BEGIN{gsub("--anaconda-project-port", "--port ", ARGV[3]); gsub("--anaconda-project-address",
      " --Voila.ip=",ARGV[9]); system("voila " ARGV[3] ARGV[4] ARGV[9] ARGV[10] "
      voila_gpx_viewer.ipynb" )}'
    supports_http_options: true

    This command is fragile because it depends on the order of the --anaconda-project-X arguments and using awk like this is not user friendly. That said, this option is currently the best one as it doesn't require any changes in voila or anaconda-project and works whether the extra arguments are supplied or not (i.e anaconda project run dashboard works fine locally).

Proposed solution

This PR proposes that the unix command support a templating mechanism allowing users to control how their commands are invoked in a flexible way. Initially I considered simple Python format strings for templating but then decided that jinja2 support would make the templating approach a lot more powerful, primarily because jinja2 supports default values. Templating would help avoid problems 1 and 2 above by formalizing a more user friendly way to take the approach in option 3.

jinja2 is an obvious choice as it is probably the most common templating language used in Python (it is already used in parts of the conda ecosystem; e.g you can use jinja2 templating in conda-build). Using an existing templating language that is well tested and documented seems to be a much more maintainable approach than inventing a new DSL.

The simplest equivalent to the awk command above in jinja2 would be:

commands:
  dashboard:
    unix: voila --port {{anaconda_project_port}} --Voila.ip={{anaconda_project_address}}

This is not quite as good as the awk command in one respect: if you don't supply any arguments (e.g anaconda run dashboard), the --port and --Volia.ip flags will be missing their values. This can be addressed using jinja2 defaults. You could specify something a little more sophisticated to handle this problem:

commands:
  dashboard:
    unix: voila --port {{anaconda_project_port|default(8890)}} --Voila.ip={{anaconda_project_address|default('localhost')}}

This is still a lot nicer than the awk command and explicit defaults are quite nice for reproducibility (e.g. this allows you to know which default settings are actually being used locally from the anaconda-project.yml).

Approach

This PR supports the jinja2 examples above but there are a few things that need to be considered carefully:

92d34a6: In this commit, the existing parsing of the HTTP options (HTTP_SPECS) is used so that only these specific arguments can be templated in. Templating currently only occurs for unix commands when supports_http_options is False.

Note that when templating is used, the information is no longer passed to the commands in the form of flags (it is up to the user to use the information made available in the template if and as they wish).

A small amount of logic is needed to turn the flags into valid Python identifiers for jinja2 (removing the -- prefix and turning dashes into underscores). The HTTP option switches are supported with --anaconda-project-no-browser and --anaconda-project-iframe-hosts getting assigned the value of True if present (i.e. treated as a boolean flags that can only be True).

d769ead: In this commit unrecognized extra arguments are also made available for templating. Right now the parsing approach is extremely simplistic: once the HTTP options are removed, the remaining arguments are expected to come in --flag value pairs (i.e. these become key:value pairs for templating). These arbitrary flags then become the corresponding identifiers available for jinja2 templating. Unknown switches (e.g --foo by itself) are currently not supported.

Other things you can do with jinja2 templating

Substitution and default values have already been demonstrated above. Two other use cases I have thought about are:

Switches: You can transform the HTTP switches into other switches with something like {% if anaconda_project_use_xheaders %} --xheaders {% endif %}

Inverted switches: In Bokeh, instead of the --no-browser switch, you have the inverse concept , namely --show. You can automatically add --show if the --no-browser switch is not present with something like {% if anaconda_project_no_browser is not defined %} --show {% endif %}

Future ideas and other considerations

Obviously, if we take this approach, jinja2 will become a dependency of anaconda-project. The PR hasn't yet done anything to reflect this so jinja2 will need to be installed if you wish to test this PR.

Some other thoughts:

Outstanding tasks

mcg1969 commented 4 years ago

I love the idea, as well as Jim's suggestion.

jlstevens commented 4 years ago

We could consider also replacing "anaconda-project-" with "" to simplify the user-written template code.

I am against replacing those options but I just realized there is not much cost involved for supporting both. I can check if something like --port hasn't been supplied at the command line, in which case I can supply port as an alias to the template from --anaconda-project-port. If --port is explicitly supplied, that would take precedence.

As a concrete example, this would mean the template value of {{port}} would come from --port if you ran anaconda project run command --anaconda-project-port 8888 --port 9999 (i.e it would be 9999 and you can still use {{anaconda-project-port}} to get 8888). You could still use {{port}} in your template if you dropped --port 9999 bit, in which case the port value would be 8888, coming from anaconda-project-port. Does that make sense? Maybe there is a downside to this idea that I am not seeing?

I love the idea, as well as Jim's suggestion.

Great! I'll go ahead and fix the tests then (probably down to declaring jinja2 properly as a new dependency).

jbednar commented 4 years ago

That approach seems fine to me, though I'd be fine with the simplest approach of simply eliminating that string when reading the arguments, such that --anaconda-project-port 6066 --port 6067 would be undefined and not recommended (precisely the same as --port 6066 --port 6067 would also be undefined). Anyway, however you want to handle that is fine.

jbednar commented 4 years ago

Regarding your discussion points,

In this commit unrecognized extra arguments are also made available for templating. Right now the parsing approach is extremely simplistic: once the HTTP options are removed, the remaining arguments are expected to come in --flag value pairs (i.e. these become key:value pairs for templating). These arbitrary flags then become the corresponding identifiers available for jinja2 templating. Unknown switches (e.g --foo by itself) are currently not supported.

I think this is fine as a long-term solution; key:value pairs can express anything people need to control externally, and can be used to generate any type of command-line arguments a target command needs.

In future, we might want to supply the template information in alternative formats, such as a JSON dictionary.

Sure; the important thing is that the templates in the .yml file won't need to change, regardless.

The parsing of the unknown argument flags is currently very simple and does not try to guard against various types of user error. Only flags using a -- prefix are supported. Having support only for the HTTP option switches is somewhat annoying.

I really don't mind not supporting switches; key:value is so easy to explain and understand that it seems fully sufficient, to me. I'm happy treating anything that doesn't fit into the accepted format as a clear error.

  • Templating is only used when unix commands are used and supports_http_options is False. I believe that templating could remove the need for the bokeh and notebook commands and remove the need for supports_http_options (if the template doesn't use the HTTP arguments, it doesn't matter if they are always supplied or not).

As you know, I favor porting anaconda-project's functionality into conda itself, and at that point only the templating support would be needed; the other args transformers could be left here in this project at that point.

It might be nice to show an example here of how the current notebook support could be implemented using template; I can see that it handles the Bokeh support, but am not sure what different types of magic are currently done for notebook support.

jlstevens commented 4 years ago

The tests are now passing. For now, I am not supporting extra arguments for templating such as --anaconda-project-foo as I wanted to try to simplify the code (and get the tests passing) before adding new code. I still think having to have the --anaconda-project prefix is rather awkward but it is probably worth supporting more than just the HTTP options to demonstrate the generality of the approach.

jlstevens commented 4 years ago

@AlbertDeFusco Looking at the Appveyor history I see you are trying to debug Appveyor too! Your last commit has the same conda.CondaMultiError: file could not be opened successfully error that I'm seeing here. Any leads/ideas on what is going wrong here?

AlbertDeFusco commented 4 years ago

I don't know what's wrong with Appveyor. It could be the version of Conda

jlstevens commented 4 years ago

I have found a different way to install conda (using standalone conda which should also be a bit faster than using miniconda) that fixes testing on Appveyor.

All the tests should now go green. I'm not quite finished with this PR though! I need to add more unit tests of the templating in general, including tests that check that everything is working properly on Windows (I had to effectively disable the current unit test on Appveyor).

jlstevens commented 4 years ago

As of 43c2124, I've added all environment variables available for templating. This means that the example in this section of the docs can use jinja2 instead of bash/windows syntax. In particular:

commands:
  default:
    unix: "python ${PROJECT_DIR}/analyze.py"
    windows: "python %PROJECT_DIR%\analyze.py"

Can now be:

commands:
  default:
    unix: "python {{PROJECT_DIR}}/analyze.py"
    windows: "python {{PROJECT_DIR}}\analyze.py"

This avoids people from having to know about one way to inject environment variables and another way to inject template arguments from the command line.

jlstevens commented 4 years ago

The --anaconda-project-host option has revealed a general problem, namely the issue with representing flags that can (optionally) be supplied multiple times.

As a concrete example, if you supply multiple --anaconda-project-host values to anaconda-project, these values need to be mapped to multiple --allow-websocket-origin commands when running bokeh server.

There are several ways we can approach this, all of which have pros and cons:

  1. Always represent options that can have multiple values as arrays e.g host would be an array and you can use host[0] in jinja2 to index the first item in the array etc. Pros: Consistency e.g a jinja2 loop can be used expand as many --allow-websocket-origin flags as needed. Cons: If the common case is to have only one host, then having to always specify host[0] is annoying and means users need to learn concepts they may not be relevant to their common needs.

  2. Switch to an array when multiple values are supplied e.g host is just a string if a single value is specified, otherwise it is an array of strings. Pros: Makes the common case easy and means people only need to know that this variable may be an array version if they encounter the case where multiple hosts are specified Cons: Lacks consistency as the type of host can change from a string to an array. A single jinja2 loop won't be able to be general and handle all cases (i.e one or more).

  3. Have a singular and plural form of the variable where the singular form is the last element of the array and the plural form is the array e.g host is always one host and hosts is the list of hosts. Pros: Once you have chosen host or hosts the type stays consistent. Most people only need to know about host and not hosts. Cons: Expands the namespace of variables used for templating.

In addition to all this, there is the question of whether you could use environment variables to achieve this instead. Bokeh already supports an BOKEH_ALLOW_WS_ORIGIN environment variable which can accept a comma separated list of hosts. The question then is whether we could turn environment variables into arrays for jinja2 templating in a general and consistent way. I don't think this is possible as there is no universal separate or convention for specifying environment variables that are arrays that I know of.

For the specific case of anaconda-project-host, I have opted for number 3 (i.e to offer hosts as an array as well as just host). I think this is the best approach for this specific use case but I am not sure it would be a good thing to generalize (i.e to pluralize all variable names to offer arrays, just in case multiple values are specified).

jlstevens commented 4 years ago

I've now updated the documentation to describe templating. @mcg1969 @jbednar I suppose we will only want to trigger readthedocs to update the website when the next version of anaconda-project is released? If it is possible to do a dev build of the docs first, that might also be a good idea.

I would also like to do a pass over the entire docs to fix all the references to the bokeh command which now only works with very old versions of Bokeh and replace them with templated equivalents. Unfortunately, there are a lot of such examples so I think such a pass over the docs should be done in a separate PR.

jlstevens commented 4 years ago

Here is what can replace thebokeh command using templating (using a unix command instead):

bokeh {% for host in hosts %} --allow-websocket-origin {{host}} {% endfor %}\
      {{'--port   %s' % port       if port         is defined}}\
      {{'--prefix %s' % url_prefix if url_prefix   is defined}}\
      {{'--use-xheaders'           if use_xheaders is defined}}\
      {{'--show'                   if no_browser   is not defined}}  <file-or-dir>

This should work with modern versions of Bokeh while the built in bokeh command does not.

While this does look a little uglier, I think that overall this is fairly legible and entirely explicit. More importantly, it is easier for users to update their template in their anaconda-project.yml files if Bokeh changes than having to update anaconda-project itself.

jbednar commented 4 years ago

Looks great to me. Verbose, but very explicit and shows exactly how you'd support any other type of server that you want to use.

jlstevens commented 4 years ago

@jbednar Here is most of what is needed for a template equivalent of a notebook command:

notebook {{'--port   %s' % port                     if port         is defined }}\
         {{'--ip     %s' % address                  if address      is defined}}\
         {{'--NotebookApp.trust_xheaders=True'      if use_xheaders is defined }}\
         {{'--no-browser'                           if no_browser   is defined }}\
         {{'--NotebookApp.base_url=%s' % url_prefix if url_prefix   is defined}}\
         --NotebookApp.default_url=/notebooks/example.ipynb

This doesn't do two things:

Unfortunately, a crazy level of ugly escaping is going to be required for the latter. I believe this should work:

notebook {{'--port   %s' % port                     if port         is defined }}\
         {{'--ip     %s' % address                  if address      is defined}}\
         {{'--NotebookApp.trust_xheaders=True'      if use_xheaders is defined }}\
         {{'--no-browser'                           if no_browser   is defined }}\
         {{'--NotebookApp.base_url=%s' % url_prefix if url_prefix   is defined}}\
         {{'--NotebookApp.tornado_settings={\\\\\\"headers\\\\\\"\':\'{\\\\\\"Content-Security-Policy\\\\\\"\':\'\\\\\\"frame-ancestors\\\\  self\\\\ %s\\\\\\"}}' % iframe_hosts if iframe_hosts is defined }}\
         --NotebookApp.default_url=/notebooks/example.ipynb

Right now it looks like templating can get us where we need to be for processes like flask, Voila and Bokeh but additional work will be needed to get a clean templating approach for notebooks that support iframe hosts. Any suggestions?

jlstevens commented 4 years ago

Some ideas of my own:

jlstevens commented 4 years ago

@mcg1969 I think this PR is ready for a review pass when you get a chance to look at it.

jlstevens commented 4 years ago

As a sanity check, I can confirm that the templating (including use of environment variables) works on Windows. However, jinja2 needed manual installation which is now listed as an explicit dependency in #251.