Per notebook MANIFESTS - Githubissues

Keno commented 6 years ago

Now, that 0.7 is getting closer, it may make sense to start thinking about how notebooks interact with the new package manager. I had discussed with @StefanKarpinski and @KristofferC that it would be great if notebooks could embed a MANIFEST and thus if you send somebody a notebook they could automatically load everything with the correct versions. Doing something like this would require figuring out where to store the information, how to hook it up to Pkg3 and probably require some UI work as well.

vchuravy commented 6 years ago

In discussion in slack I cam up with the following:

import LibGit2
using Pkg
env = joinpath(@__DIR__, "DemoEnvironment")
isdir(env) || LibGit2.clone("https://gist.github.com/2e4ebf0df689f4409d4341d366c89f15.git", env)
repo = LibGit2.GitRepo(env)
LibGit2.checkout!(repo, "708b17c88a89a88f08f4f1070e04b2a32974b1b7", force = true)
Pkg.activate(env)
pkg"instantiate"
pkg"precompile"

While quite verbose it encapsulates what I want from Manifest integration in Jupyter.

The ability to just send a notebook around/have someone download a notebook
Versioned dependencies that point to a "immutable" location

Kristoffer had the idea that we could maybe have something like Pkg.activate("https://gist...", "sha")that basically activates an anonymous environment and uses the merkle hash of gist repo + sha to cache and identify the environment. I have wanted a environment publish command before that takes my current environment and uploads it so that I can share it with others for debugging reasons.

StefanKarpinski commented 6 years ago

I'm glad this led to something that seems generally useful and also not Jupyter-specific.

tkf commented 6 years ago

Let me shamelessly point out that activating a remote repository is what I suggested in the very first post https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-414033742

arnavs commented 6 years ago

FYI, we've tagged a release of QuantEcon/InstantiateFromURL.jl, which implements the idea from @vchuravy above.

jlperla commented 6 years ago

To be clear, this first implementation is for a light repo with package and manifest. Which provides a solution for tightly controlled lecture notes /etc. The gist approach, which would be better for less formal setups, could be added as well if anyone is interested

tkoolen commented 6 years ago

I think https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-425306944 is missing a Pkg.build() step in order for things to be guaranteed to work starting from a clean slate. Would be nice not to have to do that every time you run the notebook though.

KristofferC commented 6 years ago

Instantiate builds the packages that got downloaded so don't think that is required.

tkoolen commented 6 years ago

I seem to recall cases when that didn't happen, but maybe that was just because the build had failed during an earlier instantiate call.

arnavs commented 6 years ago

@tkoolen FYI, the way we avoid rebuilding every time is to either (a) precompile the resources, for git refs that point to moving targets like master, or (b) version the resources using git tags, so something like activate_github("arnavs/InstantiationTest", tag = "v0.1.0") will never be updated.

c42f commented 5 years ago

Perhaps there's a very simple solution to this problem: treat the desired embedded environment metadata as code in the first executable cell. The question then becomes how to make it unobtrusive in the standard jupyter UI. It appears the UI doesn't do line wrapping, so there might also be a simple answer to that as well: base64 encode the toml files into a single line each.

The nice thing about this is that it's a solution for scripts which need to "come with their environment" just as much as jupyter notebooks. Then we'd just need a package ProjectEnvironments (or something) with a very simple and forward/backward compatible API which people could add manually, and which acts as the springboard into the well defined environment for the notebook.

Would this work or have I missed something?

c42f commented 5 years ago

I tried implementing this; there's a few goatchas but it looks like it will work. Gotchas include:

We need to load CodeEnvironments (my working name for the package) from some default environment. Its API would need to be very forward and backward compatible.
Mutable environments are somewhat problematic; you want the jupyter user to be able to add easily to the environment, but this conflicts with a desire to make them immutable and content addressed for the purposes of activating them from jupyter code. I think this can be managed in practice with some clear warnings but it's bit of a nasty wrinkle.

Generally there seems to be some impedance mismatch with Pkg, which is probably not a surprise given that I don't know a lot about Pkg ;-) It does, however, offer a way to have per-notebook embedded manifests and project files.

Keno commented 5 years ago

Another datapoint: I'd like to be able to send people links to colab notebooks with in-built environments, but the unit they use is a file :)

c42f commented 5 years ago

I think my proposed solution/workaround would be ok for that. Would you be interested in it becoming a registered package? I'd need to think a bit more about the workflow and API, and probably involve Pkg people to know whether it's going to work out, or is fundamentally broken in some way. But I'm not sure whether to do that extra work yet.

jlperla commented 5 years ago

@Keno @c42f If you have been using a solution like this, there is another use case to consider: getting notebook users able to update the project and manifest when necessary. This has proven to be very important for our set of lecture notes, otherwise people effectively start copying around notebooks and using copies of them to edit for assignments.

After doing this for the last 8 months, my gut says that metadata in a notebook could become hellish to maintain and lead to all sorts of user issues wondering why they have the wrong versions of packages. I used to be of the opinion that hidden metadata was the right way to go, but have reversed my stand completely. On the other hand, I will never reverse my stand that notebooks have to execute self-contained from a single file and that copies around toml files is a terrible idea.

For what it is worth, the approach we implemented from people's (i..e @vchuravy 's) suggestion in https://github.com/QuantEcon/InstantiateFromURL.jl/ has been very successful. Basically,

Put a manifest and project file on github
At the top of the notebook you call the package to "activate" a particular version of it (e.g. a tagged release, a commit, or master). We have been tagging versions, but you could do any of them including just commits
When that line is run, it looks in a local temporary .project file to see if the version of that package has been downloaded. If not, it downloads, activates, and instantiates. Otherwise it just activates. The instantiation has been a very helpful step for ensuring people are using the right versions of the packages and it makes installation a joke.
If we ever need to update the set of packages, we can just tell people to bump a tag version (or hash tag) at the top of the notebook.

Take a look at https://github.com/QuantEcon/quantecon-notebooks-jl/blob/master/kalman.ipynb as an example, but basically all that is needed is the

using InstantiateFromURL
activate_github("QuantEcon/QuantEconLecturePackages", tag = "v0.9.6");

at the top of the page. The project and manifest are versioned in https://github.com/QuantEcon/QuantEconLecturePackages

Now, for those who don't need to have a mini repository, @vchuravy had the initial idea that this sort of package could have a simple utility to setup a gist instead. https://github.com/QuantEcon/InstantiateFromURL.jl/issues/18

We didn't need it ourselves and couldn't put in the development time, but I think it is exactly the sort of thing that is needed for more lightweight package management.

... all of that is to say: before starting on any new solution, please see if the workflow in this package is solid and feel free to submit PRs for new features. If enough people vet this solution, a variation on it might make sense in Pkg.jl or at least a more formally maintained package.

c42f commented 5 years ago

@jlperla That sounds like a great workflow for your use case. My reservation is that it's not self contained and requires supporting infrastructure which can't easily be updated by the end users. This is probably a good feature in your case where you're running a class with homogeneous package requirements.

On the other hand, I'm helping a group of somewhat nontechnical PhD students with heterogeneous data management and analysis tasks. My thought is that I should be able to give them jupyter notebooks (and normal scripts!) which have embedded self-contained environments. I'd also like them to be able to update package requirements as their needs change. But at the same time, have them well defined and embedded within the notebooks so that package requirements are somewhat resistant to user error (eg emailing a script and forgetting to add the Project and Manifest files).

jlperla commented 5 years ago

(eg emailing a script and forgetting to add the Project and Manifest files).

Emailing project and manifest files around simply does not work. I am completely with you. And they make put the wrong ones with the wrong files.

My thought is that I should be able to give them jupyter notebooks (and normal scripts!) which have embedded self-contained environments. I'd also like them to be able to update package requirements as their needs change.

I understand the goal of a "self-contained environment", but I would decouple that from a self-contained file. Here are some usage scenarios:

But at the same time, have them well defined and embedded within the notebooks so that package requirements are somewhat resistant to user error

What if you want to bump the versions of the notebooks that your students were using (which happens all the time with julia since the packages change frequently and are often broken)? You can't have them just do manual package operations into the metadata because it is easy to get versions out of sync or to make mistakes
What if they do package operations in the notebook and mess up versions? How do they reset the notebook.
What if they copy the notebook, start making their own changes... then you give them new ones with different and fixed packages but they get out of sync and don't realize they are working with an old set of hidden metadata.

These are the tip of the iceberg.... As I said, I used to think that this stuff belonged in the notebook but changed my tune completely after seeing usage scenarios.

I'd also like them to be able to update package requirements as their needs change.

Having these things centrally managed is extremely helpful. But I understand that having a full repo for the set of project/toml is a little heavy for most uses.

This is exactly why @vchuravy had originally suggested using a gist with some tools (which I will try to summarize below). For us, having a consistent set of versions to bump was very nice but things don't need to have a full and controlled repository.

Basically, I think he had in mind https://github.com/QuantEcon/InstantiateFromURL.jl/issues/18 as a formalization of https://github.com/JuliaLang/IJulia.jl/issues/673#issuecomment-425306944

There would be a utility to for users to easily create the gist on their github account for a given

using InstantiateFromURL
hash = publish_gist(".") # by default, gets the local `project.toml and `manifest.toml` from the local file
# Could optionally pass in the github username, or use the github config to see it.
# e.g. hash = 2e4ebf0df689f4409d4341d366c89f15

Then, in a notebook you would have the hardcoded hash and put at the top:

using InstantiateFromURL
activate_gist("2e4ebf0df689f4409d4341d366c89f15") # optionally have a tag?

If anyone wanted to update their gist, then could just call publish_gist(".", hash) to commit and push changes

.... or something along those lines.

simonbyrne commented 5 years ago

I've been using notebooks + toml in gists for a while, and while it works, there are some hassles

1) setting it up is a bit of a paint: you have to create the gist, then clone it back to the directory. Could be addressed by a script (though you would require a GitHub API key), but would be nicer if it could be done via Jupyter itself. Once set up though, pushing updating is easy.

2) all my gists end up being called "simonbyrne/Manifest.toml" (I assume because this is the file that appears first when sorted by ASCII?). GitHub doesn't seem to provide a mechanism to rename them (you can change the comment that appears below, but not the name).

arnavs commented 5 years ago

Not sure if this helps, but the InstantiateFromURL package grabs repo tarballs (which don’t require an API key), and we store them (names salted with SHA hash) in a hidden directory from where the script is run.

Could be different on the gist side, though.

jlperla commented 5 years ago

Setting it up is a bit of a paint: you have to create the gist, then clone it back to the directory. Could be addressed by a script (though you would require a GitHub API key), but would be nicer if it could be done via Jupyter itself. Once set up though, pushing updating is easy.

I agree, and those sorts of scripts built into a package seem to be what Valentin was getting at. I think it is a perfect case for a light package (which could ultimately become a feature of Pkg3 itself). I am hesitant to say that we should have it in "jupyter" or IJulia since this is a more general problem than just jupyter notebooks.

If anyone wants to work on gist features, @arnavs and I would be happy to merge them into the InstantiateFromURL.jl as a testbed

c42f commented 5 years ago

These are the tip of the iceberg.... As I said, I used to think that this stuff belonged in the notebook but changed my tune completely after seeing usage scenarios.

These are all good points but come with strong assumptions that:

You are teaching a large set of students, who all have to do more or less the same thing (ie, class assignments).
You as the instructor are available to set up the version information in a centrally hosted location.

Consider instead that you are helping a group of nontechnical colleagues (students and lab staff) with their individual projects, each of which has different package requirements. This situation is a very different use case and I don't see how InstantiateFromURL can help with it.

jlperla commented 5 years ago

This situation is a very different use case and I don't see how InstantiateFromURL can help with it.

Hence the suggestion to have gist based workflows from some people with simple publishing tools. We didn't build it because of lack of time and not knowing requirements since we didn't need it.

My points are primarily about the difficulty of having relatively non-technical people manage project and manifests within the jupyter files themselves and all the things that can go wrong.

The other thing to consider is that the students can use a base set of packages and then install additional ones with the by the build commands at the top of their own notebooks.

But I could be wrong... Maybe there is some sort of technology that could make managing embedded package information within a notebook seamless and manageable. But it is hard to imagine without deep integration of both IJulia and Pkg3 (which there seems to be little appetite for).

tkf commented 5 years ago

@c42f FYI IJulia.load_string seems to be a better option than clipboard when you are using it in Jupyter notebooks.

Mutable environments are somewhat problematic; you want the jupyter user to be able to add easily to the environment, but this conflicts with a desire to make them immutable and content addressed for the purposes of activating them from jupyter code.

I thought about how to address it. Here is an idea: put the following code with a hypothetical function use_packages in a hypothetical package IJuliaPkg at the top of the notebook:

using IJuliaPkg
use_packages(
    [
        "Plots",
        "DifferentialEquations",
    ],
)

which adds the packages in a plain environment, encode Project.toml and Manifest.toml in base64 or upload it to gist (hereafter I call the Julia object for it $ENCODED_PROJECT), and then replace the current cell with

using IJuliaPkg
use_packages(
    [
        "Plots",
        "DifferentialEquations",
    ],
    project = $ENCODED_PROJECT,
)

using IJulia.load_string(..., true). It should be easy to make use_packages idempotent; i.e., don't do anything other than instantiate+activate when the set of packages to be installed is identical to the one recorded in Project.toml in $ENCODED_PROJECT. I think this lets you change the requirements of the notebook as you go. That is to say, if you want to import PyPlot, go to the top of the notebook and edit it to

using IJuliaPkg
use_packages(
    [
        "Plots",
        "DifferentialEquations",
        "PyPlot",
    ],
    project = $ENCODED_PROJECT,
)

and then hit shift+enter which updates $ENCODED_PROJECT.

c42f commented 5 years ago

@tkf thanks, that's an excellent point. I had just assumed overwriting a code cell from the kernel was impossible! With this in mind I think it's possible to have a self contained solution.

stevengj commented 5 years ago

Closed by #820.

JuliaLang / IJulia.jl

Per notebook MANIFESTS #673