Open pteehan opened 7 years ago
I agree that it should be possible to add several files.
I've been thinking about your last point too and I guess the only way one can ensure reproducibility is with some sort of container / virtual machine system. Docker seems to fit perfectly:
knowledge_base create
includes facilities to pull the relevant docker image from dockerhub or wherever and perform any additional configuration steps like installing extra packages etc.Note that docker images are built incrementally, everything is very finely grained hashed. Also: overheads are minimal, you are ready to deploy to cloud computing services, etc.
Hi guys,
All of the ideas presented here sounds really cool; and we'd love to get to a place where it makes sense to do these things. Having a way to guarantee that the code uploaded to the knowledge repository can run reproducibly would be a big boon. As a corollary of this, it may even be possible for the knowledge repo server to securely run code in the notebook itself.
One way to mitigate the repository from growing too large might be to instead store a prescription for how one would build the virtual environment required to run the notebook, rather than the actual virtual environment itself.
For example, we might be able to extend the knowledge post structure to something like:
knowledge_post.kp/
- build/
- env.sh
- requirements.txt
- src/
- lib.py
- ...
- knowledge.ipynb
- knowledge.md
- FORMAT
- REVISION
- UUID
where build/env.sh would be the script required to recreate the virtual environment; and so on. This could also be used to recreate R environments, or even other languages.
Anyway... food for thought. We'll get there eventually!
+1 being able to add files to a knowledge post so they can be accessed from within the post would be really useful.
Hi all, this would be a great feature as now I need to add extra files (like src files) manually via git commands after I produce a KP
@romanovzky we are actively working on kp v2, if it is ok could you provide more context how do you use knowledge repo?
Hi @JJJ000 (and replying also from https://github.com/airbnb/knowledge-repo/issues/271#issuecomment-1352607028 in a conversation with @csharplus )
I only work part time at my current role and I can't justify a full briefing/zoom meeting with either of you, but thanks for the suggestion @csharplus .
I can tell you about our use-case:
The decision of using knowledge-repo came about as to our needs to share and persist data science and machine learning reports, prototypes, etc across a heterogeneous group of stakeholders (from non-tech to engineers). The requirements were:
src
allows for any other data scientist or engineer to pick up a report, adapt, update, and submit a new versionWe have encountered a few problems with knowledge-repo:
src
, which this post explains well. This is important for reproducibility, as the notebook associated with the knowledge post should bring all the required extra codeIdeally, I'd like to see a better workflow where src
and even build
features are clear. The build
part might seem irrelevant from a static knowledge-repo design decision, but some engineers are suggesting that some of the data science reports could/should be re-run sporadically, and having a src
and build
we could possibly automate that process, but again this is very much a "nice to have" at this stage for us.
Another "nice to have" that I would like to see would be whether it is possible to find a workflow within the knowledge-repo that is compatible/easily adaptable to workflows like the one of Kedro (https://kedro.org/ , https://kedro.readthedocs.io/en/stable/) where I imagine a pipeline Kedro->knowledge-repo would be a very powerful thing for organisations to systematise and operationalise their data science reporting with reproducible reports.
Please let me know if you have further questions, I'll drop by whenever I find the time.
@romanovzky thanks a lot for your feedbacks. Those are amazing ideas, for some of your concerns such as login service, kp right now support list of authentically mechanism such as 'debug', 'oauth2', 'bitbucket', 'github', 'google', 'ldap'. There were several fixes should be included in the next release. Regarding including extra files, we will prioritize this feature. Stay tune on this. BTW, since we are working on V2, all feedbacks are welcome. Let us know if you would like to have a zoom call.
Thanks for listening! I will keep an eye open on this rep and we'll talk if time permits it! Cheers
Auto-reviewers: @NiharikaRay @matthewwardrop @earthmancash @danfrankj
This is a bit of a discussion question - I'd like to understand how you see the knowledge_repo being used. My understanding is that one of the goals is to make knowledge reproducible which I strongly agree with. But let's imagine I have the following project structure:
The main file in my_notebook.ipynb but all of these files are needed to make the code runnable; really the 'knowledge' is distributed across all of the files. Also, there are probably hardcoded paths in the notebook (e.g. run '../src/extra_scripts_1.py'). To support this the following could be useful:
And maybe, though this is a big question:
I saw in the discussion in #95 that there is a worry that the repo may get bloated. I agree that's a concern, but I don't see a way around it; either the knowledge repo includes everything, or the code is not runnable. The only other thing I can think of would be linking to source tarballs on S3 or something like that. If so, perhaps the knowledge repo could support some kind of a packaging and unpackaging process.
Looking forward to your insights. Thanks.