datalad / datalad.org

Website sources of datalad.org
https://www.datalad.org
4 stars 10 forks source link

Example involving existing source code? #55

Closed jjlee closed 5 years ago

jjlee commented 5 years ago

As a programmer doing some machine learning and interested in datalad for reproducibile work and some annex features, what would encourage me to get my feet wet with datalad is an example showing how to import some existing source code (presumably using datalad install?) without my code repo getting too entangled with datalad / git annex.

I want datalad to do its thing by complaining if I'm running steps I want to be reproducible but haven't committed my code -- I guess datalad's answer to this is to make the source code repo a subdataset of my datalad repo(s)? However, if I can avoid it I don't want my source code github repos ending up with a lot of git annex / datalad branches & submodules & metadata that I don't yet understand, or accidentally add code files annexed when they should be unannexed, or get in the way of working with the source code as a separate repo rather than as a datalad "subdataset" submodule.

Does datalad support that? If so, a short example along these lines would go a long way towards "yes this is the right tool for me, let's try it" :-)

mih commented 5 years ago

I want datalad to do its thing by complaining if I'm running steps I want to be reproducible but haven't committed my code -- I guess datalad's answer to this is to make the source code repo a subdataset of my datalad repo(s)?

Yes, that is the way.

However, if I can avoid it I don't want want my source code github repos ending up with a lot of git annex / datalad branches

Datalad will not automatically add an annex to a plain git repository (like you code repo will be), and also do not add branches or other magic.

submodules

Your code repo will be a submodule of the dataset you are working on, but not vice versa, hence your code repo is not touched per se.

metadata

Metadata handling is completely optional. Unless you actively run aggregate-metadata, no metadata will appear anywhere.

accidentally add code files annexed when they should be unannexed

As mentioned above, your code repo will stay a plain Git repo, hence no accidental annexing possible. In your working repo, you can configure the desired behavior via .gitattributes, or (if you do not need any annexing at all) you can also just use plain git repo everwhere (see create --no-annex).

get in the way of working with the source code as a separate repo rather than as a datalad "subdataset" submodule

Unless you run a add -r or a rev-save -r manually, datalad run will simply always complain and not "get in the way" otherwise ;-)

I am closing this issue here. This is the repo of the datalad website. Please reopen at https://github.com/datalad/datalad for further discussion.