JuliaLang / IJulia.jl

Julia kernel for Jupyter
MIT License
2.79k stars 410 forks source link

Per notebook MANIFESTS #673

Closed Keno closed 5 years ago

Keno commented 6 years ago

Now, that 0.7 is getting closer, it may make sense to start thinking about how notebooks interact with the new package manager. I had discussed with @StefanKarpinski and @KristofferC that it would be great if notebooks could embed a MANIFEST and thus if you send somebody a notebook they could automatically load everything with the correct versions. Doing something like this would require figuring out where to store the information, how to hook it up to Pkg3 and probably require some UI work as well.

stevengj commented 6 years ago

maybe with the contents API

StefanKarpinski commented 6 years ago

I believe that what's needed is a "environment protocol": i.e. instead of needing to actually have a project file and/or manifest file present, or a package directory, or load path array, one just needs to implement the environment protocol. Then the IJulia package can implement the protocol for notebooks that have environment information stored in them and voila, each notebook has its own environment. However, I think that work is a 1.x kind of thing: we now generally understand what the protocol needs to look like; the next step is to factor out the protocol part in such a way that the three kinds of environments that we already support are implementations of this protocol; after that we allow a notebook to implement the environment protocol as well.

The main thing to consider at this point is how to allow for extension in the future. Where is the hook? Do we have a Base.PACKAGE_ENVIRONMENT variable, which, if set, overrides the LOAD_PATH lookup? Or do we have some special name which can be put into the LOAD_PATH that causes loading to talk to the notebook instead?

The contents API seems like it may be a good way to stash the manifest information, but we don't really need something that emulates a file system—using a JSON store would actually be easier.

jlperla commented 6 years ago

A long-run solution is great to automatically embed the manifest/etc. But is it possible to have a short-term patch requiring a manual call to load something in the notebook itself? That is, something along the lines of

Pkg.setmanifest("Manifest.toml") #i.e., local to the notebook
using MyLib #i.e., the kernel is using the `Manifest.toml` now

Or maybe this is already possible with some of the Pkg3 commands in Jupyter?

StefanKarpinski commented 6 years ago

🤷‍♂️ maybe?

KristofferC commented 6 years ago

Can't you just activate the dir of the notebook? Then that notebook will use a separate environment that will be stored next to the notebook.

jlperla commented 6 years ago

To make sure I understand this, you think I may just be able to put a Manifest.toml in the notebook directory, then I should just need to run:

Pkg.activate(".")
using MyLib

If that is correct, I can try to have someone test it when IJulia is sufficiently stable with 0.7

KristofferC commented 6 years ago

You need a Project file as well. But yes if you do

Pkg.activate(".")

and then go wild with adding packages, those will be recorded in Project.toml and Manifest.toml along the notebook, and if you send these files to someone else, they can do

Pkg.activate(".")
Pkg.instantiate()

to install all the packages at the version you used them.

tkf commented 6 years ago

If opening a notebook can instantiate arbitrary Manifest.toml (which may contain arbitrary repo-url), isn't it a security hole? Isn't it also incompatible with the security model of Jupyter notebook (= trust if you execute it)?

How about adding a simple function that uploads Project.toml and Manifest.toml to gist and then call IJulia.load_string to inject something like

Pkg.activate("https://gist.github.com/.../...")

to a notebook cell? Of course, Pkg.activate then has to support downloading *.toml when a URL is specified. Pkg.activate can also check if those packages are from the known registries and prompt user if not.

Alternatively, I guess you can use cell attachments to bundle *.toml into the notebook file but it would require the kernel and the server to be in the same machine. For example, it won't work if you launch a Julia kernel on a HPC cluster via a Jupyter notebook server running on your laptop.

jlperla commented 6 years ago

I don't know jupyter all that well, but isn't the security controlled by how it is contained? You can load local files, run shell stuff, etc if it lets you?

Certainly being able to instantiate a local manifest is not the long-run solution, and will not work for all scenarios, but I don't think it is a security hole.

vchuravy commented 6 years ago

I now have this in my notebooks

using Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()
pkg"precompile"

Pkg.activate(".") doesn't work well since you can start your jupyter notebook from any working directory.

KristofferC commented 6 years ago

Storing the Manifest + Project inside the notebook and have a button that does that would come a long way. There shouldn't be any security problems with that, it is just a convenience layer?

tkf commented 6 years ago

Their security model is:

  • Untrusted HTML is always sanitized
  • Untrusted Javascript is never executed
  • HTML and Javascript in Markdown cells are never trusted
  • Outputs generated by the user are trusted
  • Any other HTML or Javascript (in Markdown cells, output generated by others) is never trusted
  • The central question of trust is “Did the current user do this?”

--- https://jupyter-notebook.readthedocs.io/en/stable/security.html#our-security-model

So I don't think you can register any UI elements like a button to instantiate a project from the notebooks. Thought I guess that's possible via front-end extension.

I just thought using IJulia.load_string is a very simple and generic solution since it does not require writing any front-end extension. It is also useful outside Jupyter/IJulia.

Keno commented 6 years ago

So I don't think you can register any UI elements like a button to instantiate a project from the notebooks. Thought I guess that's possible via front-end extension.

This is a pretty fundamental feature, so integrating it nicely into the frontend for every julia notebook seems like the right way to do it.

tkf commented 6 years ago

If you are willing to write a front-end extension I think that's great! I have no intention of stopping it.

simonbyrne commented 6 years ago

The notebook does include a certain amount of notebook-wide metadata, detailing the language and kernel. e.g.

 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.0.0",
   "language": "julia",
   "name": "julia-1.0"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.0.0"
  }

It may be possible to insert and read the manifest information from there.

As far as a security model goes, one solution could be a confirmation dialog before installing any new package versions via activate.

simonbyrne commented 6 years ago

Well, I asked on the Jupyter gitter: it seems like this is not possible via the current protocol, so if we wanted something along those lines we would need to do it via a jupyter extension.

simonbyrne commented 6 years ago

What if we added a function to IJulia which did something along the lines of what @vchuravy suggested, e.g.

using Pkg
function useproject(path=pwd())
    Pkg.activate(path)
    Pkg.instantiate()
    pkg"precompile"
end

Then, at the top of the notebook you could just do

IJulia.useproject()
tkf commented 6 years ago

It does not work when IJulia kernel and Jupyter server run in different machines. https://bitbucket.org/tdaff/remote_ikernel/src/default/ https://github.com/ipython/ipython/wiki/Cookbook:-Connecting-to-a-remote-kernel-via-ssh

StefanKarpinski commented 6 years ago

At JupyterCon I spoke with a few Jupyter folks and their take was that trying to put this kind of metadata into notebooks was not the right direction to go—they've tried this with images and other things in the past and have come to feel that the "unit of distribution" should be a git repo, not a single notebook file. So it seems like the way to go here might be to have IJulia automatically activate the project in the git repo that it's in. After all, you are running the code in the notebook, so presumably you trust it. (As compared to just starting a Julia process in a directory, which may or may not mean that you trust the content of the directory enough to execute it.)

stevengj commented 6 years ago

IJulia doesn't know what notebook file (if any) it is executing — that information is not provided to the kernel.

Keno commented 6 years ago

At JupyterCon I spoke with a few Jupyter folks and their take was that trying to put this kind of metadata into notebooks was not the right direction to go—they've tried this with images and other things in the past and have come to feel that the "unit of distribution" should be a git repo, not a single notebook file. So it seems like the way to go here might be to have IJulia automatically activate the project in the git repo that it's in. After all, you are running the code in the notebook, so presumably you trust it. (As compared to just starting a Julia process in a directory, which may or may not mean that you trust the content of the directory enough to execute it.)

If we go this way, I'd still like a way to package everything into a single file that you can email to somebody or share on JuliaBox (also have separate environments for every notebook on juliabox). If we I just want to share some code with somebody, I don't think we can expect the workflow to be "Go clone this git repo".

jlperla commented 6 years ago

I don't think we can expect the workflow to be "Go clone this git repo".

I agree. Jupyter notebooks need to be able to be used self-contained in some sense. Even the Jupiter interface is often around the "Upload" notebook interface.

What about the ability to activate from a URL? You could give it the project file and/or manifest, and it would enable copying jupyter around. And if someone wanted to run the notebook in whatever global project they had in their current jupyter, they wouldn't need to use those cells?

StefanKarpinski commented 6 years ago

I'm just reporting what the Jupyter people (@Carreau if I recall correctly) told me which is that they are moving away from trying to make notebooks self-contained because it has not worked out as hoped. The simplest solution would seem to be serving a zip or tar file continaing a set of notebooks, resources used by the notebook and in our case, project and manifest files.

Carreau commented 6 years ago

Yes, we tend to try to think of (1 unit == 1 repository).The notebook as unit, espescially since you can now connect many notebook to same kernel make not much sens.

We haven't really figured out how to make all of the completely work, but generally trying to shove more into a notebook does not work.

As said before a repository does not always work, but I don't think we can get a "one size fits all". There is always this tension between being able to manipulate things on the filesystem, and having everything being opaque and managed by Jupyter.

You could of course have an extension for jupyter that show "bundles" as an actual tree of files, but then you can't cd into it.

Maybe something along a fuse driver that expose a single file at some path, and repo structure in another ?

Carreau commented 6 years ago

@fperez would be interested in this discussion BTW, and I think we had pictures of a whitebord with all the different axes of what people want from notebook files.

simonbyrne commented 6 years ago

My experience is that embedding data in notebooks is a lost cause. e.g. the attachments feature is basically useless, since:

jlperla commented 6 years ago

Is there a reason not to enable on url based project and Manifest files? In a github based implementation, you could point it to the raw file, or a local url. And notebooks copied around would then work.

Does that break the Jupyter security model? Since the user would actively choose to run the script and trust the notebook, it doesn't seem like it?

vchuravy commented 6 years ago

I often enough ship notebooks to students who work in isolated (supercomputing) environments. So needing internet access to replicate a notebook would be an annoyance.

If we say that the unit is the git repository that is fine, but I really would like to enable a workflow where somebody can just grab a notebook itself.

To me Project/Manifest are very unlike pictures and other attachments. A notebook doesn't break just because a picture is missing, but it won't work without the correct manifest. So embedding that in the Metadata would be preferable.

I recently gave a workshop and it boiled down to having an environment per notebook in a different subfolder.

tkoolen commented 6 years ago

Why is git relevant to this discussion? Isn't the actual 'unit' just a directory?

StefanKarpinski commented 6 years ago

Yes, I think you're right @tkoolen, but most of the time that directory is the root of a git repo and git repos do end up being the unit of reproducibility (although git trees actually work too).

simonbyrne commented 6 years ago

@tkf what does "it" refer to here?

It does not work when IJulia kernel and Jupyter server run in different machines.

Carreau commented 6 years ago

Yes, I think you're right @tkoolen, but most of the time that directory is the root of a git repo and git repos do end up being the unit of reproducibility (although git trees actually work too).

Yes, and usually it provides "collaboration" capability like synchronisation or something else. It's one way of abusing language, saying that the unit of reproducibility/sharing is bigger than a notebook (and may not contain a notebook).

tkf commented 6 years ago

@simonbyrne IJulia.useproject() you suggested assumes that the kernel is running on the machine where the notebook file is. This is not true in general since you can connect to remote kernel via ssh (say). But maybe you can argue that it is too exotic usecase so it is not worth supporting.

BTW, my suggestion https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-414033742 does not have these shortcomings you pointed out for the attachment approach:

  • you can't access the attachments from the kernel (#625), so it's no use for embedding data.
  • attachments are lost by nbconvert (jupyter/nbconvert#699), so you can't even use it for embedding images if you want to, say, convert the notebook to a presentation.
simonbyrne commented 6 years ago

Ah, thanks. That makes sense.

To be wholly self contained (i.e. avoid things like downloading manifests from gists), I think the only real solution is to do what @StefanKarpinski suggested and have an "environment API", so that you could specify something equivalent to the Manifest.toml in the first cell (perhaps in a less verbose format), along with a "snapshot" function that would generate the necessary input.

tkf commented 6 years ago

Manifest.toml has to hold rather big metadata (dependencies of dependencies) so my naive guess is that it's hard to squash it into a small notebook code cell.

self contained (i.e. avoid things like downloading manifests from gists)

Why do you want to avoid network access while you need it to install packages anyway? Also, you can include gist sha to the url. Since git is immutable, it then becomes fully self-contained in the sense that the dependency tree is fully determined by a single notebook file.

KristofferC commented 6 years ago

The manifest is not only needed for installing packages but also to determine what you can load. Without it, you are blind. So putting that in a url seems like a bad idea.

tkoolen commented 6 years ago

Alright, so do people think it's actually desirable to have the .toml files embedded in the notebook at this point? I'd actually argue that: no it's not even desirable even if it is technically possible, because it'd be a very unconventional/magic way of working with Pkg.

Some thoughts:

tkf commented 6 years ago

The manifest is not only needed for installing packages but also to determine what you can load. Without it, you are blind.

@KristofferC That's why I suggested https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-414033742:

Pkg.activate can also check if those packages are from the known registries and prompt user if not.

Also, notebook already has using/import statements in it. It's very transparent what are going to be loaded (given that it uses registries you trust).

tkf commented 6 years ago

But another way is to run jupyter notebook from the command line, and jupyter is of course agnostic as to the kernel, so how do you make those two consistent?

@tkoolen I think the default cwd of a jupyter kernel launched by jupyter notebook/lab is the directory of the notebook file.

StefanKarpinski commented 6 years ago

What's wrong with sending people a zip file of a directory?

tkf commented 6 years ago

@StefanKarpinski It does not work with some exotic usecase: https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-424907156. But I guess not supporting it makes sense since then simple Pkg.activate("."); Pkg.instantiate() just works.

jlperla commented 6 years ago

Sorry if this is an asinine and already rejected suggestion, but what about skipping Jupyter based modifications and rely entirely on the package manager?

That is, users have the option to

  1. Follow https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-424933939 , etc.
  2. Create a lightweight package without any code, and probably just a Manifest and Project files. The notebook using the package is then completely self-contained.
    • Then at the top of the notebook they could say ] add MyNotebookProject; activate MyNotebookProject; instantiate or whatever. They would not do using MyNotebookProject because there is no code or reexporting. Or if not registried, then ] add https://github.com/myproject/MyNotebookProject.jl.
    • This would work with private repos, urls, local paths, private registries, etc....and the caching of the package manager would mean that it wouldn't necessarily require an internet condition and wouldn't require downloading if already cached - even if unregistered and added by URL.
    • Since the package would be installed wherever the kernel is running, it gives a lot of flexibility
    • The notebooks are self-contained, and in all likelihood the lightweight packages would be shared by a large number of notebooks on a site (since a super-set of the dependencies can be put into the file).

Is this an abuse of the package manager? Can the registries (and uncurated registries) handle a proliferation of lots of small convenience packages, or would it break things? Of course, this wouldn't really be one-project per notebook. I realize this is inconvenient with the current METADATA based package registration, but I imagine that infrastructure could change.

tkoolen commented 6 years ago

@StefanKarpinski, re:

What's wrong with sending people a zip file of a directory?

I think that's the way to go.

@tkf, re:

I think the default cwd of a jupyter kernel launched by jupyter notebook/lab is the directory of the notebook file.

That's true. But regardless of what pwd() is, an important question is still what Base.active_project() should be if you:

  1. run jupyter notebook from /some/path and open a Julia notebook (possibly in a subdirectory or a remote location).
  2. run using IJulia; notebook(dir="/some/path") from Julia started in an arbitrary directory and with an arbitrary Base.active_project(), and open the same notebook.

I think that calling Base.active_project() from the first cell of the notebook should at least return the same directory in both cases, but if we really believe in 'unit == directory', maybe it should be equal to joinpath(@__DIR__, "Project.toml") for the running notebook instead of the current ~/.julia/environments/v1.0/Project.toml. Either that or every notebook needs an IJulia.useproject() as the first cell as in https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-424199315, which has the advantage that it's clear what's going on, but would still be unfortunate boilerplate to have in (almost) every notebook.

Using joinpath(@__DIR__, "Project.toml") as the default Base.active_project() for a notebook also addresses https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-424216750 I think. But if this proposed default value is not what you want, you can always use Pkg.activate in a notebook cell to change it as desired, for exotic use cases.

tkf commented 6 years ago

Create a lightweight package without any code, and probably just a Manifest and Project files. The notebook using the package is then completely self-contained.

@jlperla Yeah that's essentially equivalent to my suggestion https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-414033742. Though I don't think you need to turn it into a package. With the current infrastructure, you can already do:

run(`git clone $URL workspace`)
cd("workspace")  # has Project.toml [and Manifest.toml] in it
using Pkg
Pkg.activate(".")
Pkg.instantiate()
tkf commented 6 years ago

@tkoolen I don't think automatically activating an environment works unless it is automatically instantiated. However, automatic instantiation deviates from Jupyter's security model (= you trust the notebook if you ever run it) https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-414096386. You could put using Pkg; Pkg.instantiate() in the first cell but then single IJulia.useproject() would do the same and more explicit.

Using joinpath(@__DIR__, "Project.toml") as the default Base.active_project() for a notebook also addresses https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-424216750

No, I don't think so, because in the scenario I described, the notebook file and the kernel are running in different machines (e.g., jupyter lab in your laptop and IJulia kernel on some cloud compute node). pwd() of the remote IJulia kernel won't reflect the location of local notebook path (whose directory may not exist in the mote machine).

tkf commented 6 years ago

Other than the "exotic" remote_ikernel usage, I wonder how the approach with *.ipynb and *.toml in a directory plays with the realtime collaboration in Jupyter, something like (now deprecated) jupyterlab-google-drive. It looks like the notebooks do not exist as a local JSON file anymore in this case and you can't have *.toml files besides them https://github.com/jupyterlab/jupyterlab-google-drive/issues/39.

StefanKarpinski commented 6 years ago

It does not work with some exotic usecase: #673 (comment).

I don't mean loading from a zip file, I mean just sending someone a zip file and then they unzip it. The only real requirement here seems to be being able to send people a single file. I don't see why having a single file on the local filesystem where it's running is required.

tkf commented 6 years ago

just sending someone a zip file and then they unzip it

@StefanKarpinski I'm just saying that this is too simplistic approach to cover other usages in Jupyter notebook/lab. The kernel may be running in a different machine (e.g., remote_ikernel https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-424907156) and file system may be virtualized (e.g., jupyterlab-google-drive https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-424979926).

StefanKarpinski commented 6 years ago

It does not seem like Jupyter currently has the features to support this kind of thing. I don't think it should really fall on us to try to work around such limitations. The appropriate path forward seems like it would be conveying what we would need to do what we want to do.

tkf commented 6 years ago

It does not seem like Jupyter currently has the features to support this kind of thing.

Jupyter has set_next_input protocol (invokable via IJulia.load_string) to support implementing what I suggested in https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-414033742