Reproducible code - extra files, folder structure, external source archives?

pteehan commented 7 years ago

Auto-reviewers: @NiharikaRay @matthewwardrop @earthmancash @danfrankj

This is a bit of a discussion question - I'd like to understand how you see the knowledge_repo being used. My understanding is that one of the goals is to make knowledge reproducible which I strongly agree with. But let's imagine I have the following project structure:

notebooks/my_notebook.ipynb
src/extra_scripts_1.py
src/extra_scripts_2.py
src/extra_scripts_3.py
requirements.txt

The main file in my_notebook.ipynb but all of these files are needed to make the code runnable; really the 'knowledge' is distributed across all of the files. Also, there are probably hardcoded paths in the notebook (e.g. run '../src/extra_scripts_1.py'). To support this the following could be useful:

ability to add a folder e.g. knowledge_repo add (...) --src src/
preserve folder structure among added files

And maybe, though this is a big question:

think about some kind of extra tooling for virtualenv or packrat to make it easier to manage dependencies

I saw in the discussion in #95 that there is a worry that the repo may get bloated. I agree that's a concern, but I don't see a way around it; either the knowledge repo includes everything, or the code is not runnable. The only other thing I can think of would be linking to source tarballs on S3 or something like that. If so, perhaps the knowledge repo could support some kind of a packaging and unpackaging process.

Looking forward to your insights. Thanks.

mdbenito commented 7 years ago

I agree that it should be possible to add several files.

I've been thinking about your last point too and I guess the only way one can ensure reproducibility is with some sort of container / virtual machine system. Docker seems to fit perfectly:

When creating a KP from the web interface, one selects in which environment it will run: python version, packages that will be used, etc. (maybe simplify this with some presets)
knowledge_base create includes facilities to pull the relevant docker image from dockerhub or wherever and perform any additional configuration steps like installing extra packages etc.
Edition of the notebooks happens inside the docker container.
Reviewing of pull requests happens from the UI as well: docker pull and running of the containerized notebook is automated.
Results can be automatically cached for browsing in the main interface, but running the notebooks themselves is just one click away (docker pull / build, run notebook)
etc.

Note that docker images are built incrementally, everything is very finely grained hashed. Also: overheads are minimal, you are ready to deploy to cloud computing services, etc.

matthewwardrop commented 7 years ago

Hi guys,

All of the ideas presented here sounds really cool; and we'd love to get to a place where it makes sense to do these things. Having a way to guarantee that the code uploaded to the knowledge repository can run reproducibly would be a big boon. As a corollary of this, it may even be possible for the knowledge repo server to securely run code in the notebook itself.

One way to mitigate the repository from growing too large might be to instead store a prescription for how one would build the virtual environment required to run the notebook, rather than the actual virtual environment itself.

For example, we might be able to extend the knowledge post structure to something like:

knowledge_post.kp/
- build/
  - env.sh
  - requirements.txt
- src/
  - lib.py
  - ...
  - knowledge.ipynb
- knowledge.md
- FORMAT
- REVISION
- UUID

where build/env.sh would be the script required to recreate the virtual environment; and so on. This could also be used to recreate R environments, or even other languages.

Anyway... food for thought. We'll get there eventually!

nathancomiskey commented 7 years ago

+1 being able to add files to a knowledge post so they can be accessed from within the post would be really useful.

romanovzky commented 1 year ago

Hi all, this would be a great feature as now I need to add extra files (like src files) manually via git commands after I produce a KP

JJJ000 commented 1 year ago

@romanovzky we are actively working on kp v2, if it is ok could you provide more context how do you use knowledge repo?

romanovzky commented 1 year ago

Hi @JJJ000 (and replying also from https://github.com/airbnb/knowledge-repo/issues/271#issuecomment-1352607028 in a conversation with @csharplus )

I only work part time at my current role and I can't justify a full briefing/zoom meeting with either of you, but thanks for the suggestion @csharplus .

I can tell you about our use-case:

The decision of using knowledge-repo came about as to our needs to share and persist data science and machine learning reports, prototypes, etc across a heterogeneous group of stakeholders (from non-tech to engineers). The requirements were:

Ease of access for a wider audience: this excluded github as a notebook holder
Capacity to display the work in a self-contained manner as notebooks: this excluded confluences, and other wiki-style knowledge repositories
Possibility to host tutorials and on-boarding materials which are not notebooks (like markdowns, etc): this excluded heavier solutions like binder
Versioning of work: this excluded turning the notebooks into markdowns and place them on gitbooks, etc
Discoverability: this really highlighted the tags feature of knowledge-repo
Reproducibility: again, having the notebook as src allows for any other data scientist or engineer to pick up a report, adapt, update, and submit a new version

We have encountered a few problems with knowledge-repo:

Documentation is lacking on login service details, so login features are not implemented on our side
Documentation is lacking on how to add other files to be in src, which this post explains well. This is important for reproducibility, as the notebook associated with the knowledge post should bring all the required extra code
We were running the server on an instance, but the server seems to die every time we log out of the instance, so we are only running the server locally (this invalidates the wider audience aspect, but it's still early days so not a lot of harm has been done yet)

Ideally, I'd like to see a better workflow where src and even build features are clear. The build part might seem irrelevant from a static knowledge-repo design decision, but some engineers are suggesting that some of the data science reports could/should be re-run sporadically, and having a src and build we could possibly automate that process, but again this is very much a "nice to have" at this stage for us.

Another "nice to have" that I would like to see would be whether it is possible to find a workflow within the knowledge-repo that is compatible/easily adaptable to workflows like the one of Kedro (https://kedro.org/ , https://kedro.readthedocs.io/en/stable/) where I imagine a pipeline Kedro->knowledge-repo would be a very powerful thing for organisations to systematise and operationalise their data science reporting with reproducible reports.

Please let me know if you have further questions, I'll drop by whenever I find the time.

JJJ000 commented 1 year ago

@romanovzky thanks a lot for your feedbacks. Those are amazing ideas, for some of your concerns such as login service, kp right now support list of authentically mechanism such as 'debug', 'oauth2', 'bitbucket', 'github', 'google', 'ldap'. There were several fixes should be included in the next release. Regarding including extra files, we will prioritize this feature. Stay tune on this. BTW, since we are working on V2, all feedbacks are welcome. Let us know if you would like to have a zoom call.

romanovzky commented 1 year ago

Thanks for listening! I will keep an eye open on this rep and we'll talk if time permits it! Cheers

airbnb / knowledge-repo

Reproducible code - extra files, folder structure, external source archives? #141