airbnb / knowledge-repo

A next-generation curated knowledge sharing platform for data scientists and other technical professions.
Apache License 2.0
5.47k stars 688 forks source link

Reproducible code - extra files, folder structure, external source archives? #141

Open pteehan opened 7 years ago

pteehan commented 7 years ago

Auto-reviewers: @NiharikaRay @matthewwardrop @earthmancash @danfrankj

This is a bit of a discussion question - I'd like to understand how you see the knowledge_repo being used. My understanding is that one of the goals is to make knowledge reproducible which I strongly agree with. But let's imagine I have the following project structure:

notebooks/my_notebook.ipynb
src/extra_scripts_1.py
src/extra_scripts_2.py
src/extra_scripts_3.py
requirements.txt

The main file in my_notebook.ipynb but all of these files are needed to make the code runnable; really the 'knowledge' is distributed across all of the files. Also, there are probably hardcoded paths in the notebook (e.g. run '../src/extra_scripts_1.py'). To support this the following could be useful:

And maybe, though this is a big question:

I saw in the discussion in #95 that there is a worry that the repo may get bloated. I agree that's a concern, but I don't see a way around it; either the knowledge repo includes everything, or the code is not runnable. The only other thing I can think of would be linking to source tarballs on S3 or something like that. If so, perhaps the knowledge repo could support some kind of a packaging and unpackaging process.

Looking forward to your insights. Thanks.

mdbenito commented 7 years ago

I agree that it should be possible to add several files.

I've been thinking about your last point too and I guess the only way one can ensure reproducibility is with some sort of container / virtual machine system. Docker seems to fit perfectly:

Note that docker images are built incrementally, everything is very finely grained hashed. Also: overheads are minimal, you are ready to deploy to cloud computing services, etc.

matthewwardrop commented 7 years ago

Hi guys,

All of the ideas presented here sounds really cool; and we'd love to get to a place where it makes sense to do these things. Having a way to guarantee that the code uploaded to the knowledge repository can run reproducibly would be a big boon. As a corollary of this, it may even be possible for the knowledge repo server to securely run code in the notebook itself.

One way to mitigate the repository from growing too large might be to instead store a prescription for how one would build the virtual environment required to run the notebook, rather than the actual virtual environment itself.

For example, we might be able to extend the knowledge post structure to something like:

knowledge_post.kp/
- build/
  - env.sh
  - requirements.txt
- src/
  - lib.py
  - ...
  - knowledge.ipynb
- knowledge.md
- FORMAT
- REVISION
- UUID

where build/env.sh would be the script required to recreate the virtual environment; and so on. This could also be used to recreate R environments, or even other languages.

Anyway... food for thought. We'll get there eventually!

nathancomiskey commented 7 years ago

+1 being able to add files to a knowledge post so they can be accessed from within the post would be really useful.

romanovzky commented 1 year ago

Hi all, this would be a great feature as now I need to add extra files (like src files) manually via git commands after I produce a KP

JJJ000 commented 1 year ago

@romanovzky we are actively working on kp v2, if it is ok could you provide more context how do you use knowledge repo?

romanovzky commented 1 year ago

Hi @JJJ000 (and replying also from https://github.com/airbnb/knowledge-repo/issues/271#issuecomment-1352607028 in a conversation with @csharplus )

I only work part time at my current role and I can't justify a full briefing/zoom meeting with either of you, but thanks for the suggestion @csharplus .

I can tell you about our use-case:

The decision of using knowledge-repo came about as to our needs to share and persist data science and machine learning reports, prototypes, etc across a heterogeneous group of stakeholders (from non-tech to engineers). The requirements were:

We have encountered a few problems with knowledge-repo:

Ideally, I'd like to see a better workflow where src and even build features are clear. The build part might seem irrelevant from a static knowledge-repo design decision, but some engineers are suggesting that some of the data science reports could/should be re-run sporadically, and having a src and build we could possibly automate that process, but again this is very much a "nice to have" at this stage for us.

Another "nice to have" that I would like to see would be whether it is possible to find a workflow within the knowledge-repo that is compatible/easily adaptable to workflows like the one of Kedro (https://kedro.org/ , https://kedro.readthedocs.io/en/stable/) where I imagine a pipeline Kedro->knowledge-repo would be a very powerful thing for organisations to systematise and operationalise their data science reporting with reproducible reports.

Please let me know if you have further questions, I'll drop by whenever I find the time.

JJJ000 commented 1 year ago

@romanovzky thanks a lot for your feedbacks. Those are amazing ideas, for some of your concerns such as login service, kp right now support list of authentically mechanism such as 'debug', 'oauth2', 'bitbucket', 'github', 'google', 'ldap'. There were several fixes should be included in the next release. Regarding including extra files, we will prioritize this feature. Stay tune on this. BTW, since we are working on V2, all feedbacks are welcome. Let us know if you would like to have a zoom call.

romanovzky commented 1 year ago

Thanks for listening! I will keep an eye open on this rep and we'll talk if time permits it! Cheers