Edit scripts to correspond to the data sharing plan and test scripts

surchs commented 3 years ago

We now have a data sharing plan. Now the code can be edited so all the hardcoded paths point to the file names and locations defined in the data sharing plan.

Let's say the data will live under the root of this repository in a folder called data (i.e. /data). Then a script in the /scripts_folder pointing to the fractional anisotropy input matrices should use the relative path:

../data/{leftstri or rightstri}_fa_input_matrix.mat

and so on. So to complete this issue:

[x] edit all the scripts to use relative paths (relative to the /scripts_folder) and the new file names defined in the data sharing plan
[x] make a copy of the intermediate files listed in the data sharing plan and rename them according to the plan (this + some documentation will then become the paper data release)
[ ] make sure you have an up to date local copy of the scripts with the newly changed paths. Sidenote: there is a chance this is trickier than it sounds if you have made changes both on your local copy of the repo and here directly on github. A safe way to check is to go to your local copy of the repo, run git fetch to get the updated git history from github (but not yet download any files) and then git status to learn what, if anything, has changed remotely or locally. If there aren't any conflicts you can then run git pull to download the new or updated files to your local repo copy.
[x] move your folder with the correctly renamed intermediate files into the git repo (and make sure the folder is called data). Be careful not to add the folder or it's contents to git though, because we don't want to track the data. A good way to tell git to just ignore the entire ./data/ directory is to use a .gitignore file. This is just a file called .gitignore (no file ending). In it you can write data/ to ignore the data directory. You can take a look at some common templates too .
[ ] with everything in place, make sure that your scripts load and save to the correct paths. If a script takes long to run then maybe just run the part of the script that loads the data. We just want to make sure that the paths you typed in the scripts and the actual paths really match. And you can always find some surprising little things by trying.

This might take some time to do. Let me know if you encounter any questions or difficulties.

corinnerobert commented 3 years ago

move your folder with the correctly renamed intermediate files into the git repo (and make sure the folder is called data). Be careful not to add the folder or its contents to git though

How do I add the data folder to the git repo if I told git to ignore it with a .gitignore file? I'm confused

raihaan commented 3 years ago

There is a section in https://www.atlassian.com/git/tutorials/saving-changes/gitignore on Committing an ignored file. Would this help? I haven't done this myself before so i will defer to Sebastian

surchs commented 3 years ago

Hey everyone. @corinnerobert: great question and important conceptual thing about git! I don't have a great tutorial to link here (this is complete but hard to follow, I find) and this is a bit tricky - I'll find something or make something.

The short answer is that when you have a local git repository that lives on your hard drive under a folder, there are three things that you need to distinguish:

Your local working directory. That is just the folder as you can see it in a file explorer or look at in a command line with e.g. ls. All the happy files in there are on your harddrive. But, that's the key part to understand, they aren't necessarily tracked by git just because they are in this folder.
Let's say you have a new file in the local folder and this file isn't tracked by git. You think: this is a good file, I'd like to put this under version control. Then you run git add that_file_name.txt. This puts the file in the git staging area. This is kinda like your packaging area. You put all the stuff you want git to track (e.g. new files or files that have changed) here.
Now when you have everything you want in your staging area, you tie your package and put it inside git. This is called a commit. You use git commit to do that, and then enter a little message to explain what's in the package (aka commit). This is the final step and makes git keep track forever of the stuff in this commit.

So coming back to your question:

How do I add the data folder to the git repo if I told git to ignore it with a .gitignore file? I'm confused

Perfectly reasonable. The answer is: you put the folder in your local working directory, but because you tell git to ignore it (with the .gitignore) file, git will never aks you whether you want to move that folder to the staging area (unless you do some of the black magic @raihaan pointed out) or even whether you want to commit it.

I like to think of it as an office:	My office	Git
My messy office floor	My local working directory
My organized desktop where I prepare a package to send to a friend who runs a beautiful archive of things	My git staging area where I collect things I want to commit to my repo
My beautifully organized and tied package to my friend who now files it into a beautiful archive	My beautifully commented git commit that is now tracked by git and inside my repo

So basically you copying the folder to the local working directory is equal to you keeping a grey box of stuff you sometimes need on the floor in your office. But you don't want to send this box to your archivist friend because it's heavy and messy and you don't want it archived. So you make a little note on your desktop that says: "ignore grey boxes of stuff when tying packages". Hope that clarifies it.

edit: maybe this tutorial is a bit more clear on this than the previous link

corinnerobert commented 3 years ago

This is super clear and it answers my question thank you!

corinnerobert commented 3 years ago

Hi, I realized I have some scripts that need some individual maps (for instance script 5) and I'm not sure how to deal with that

surchs commented 3 years ago

@corinnerobert: can you describe this problem in some more detail e.g.

which scripts need inputs that aren't part of the paper-related data release (aka intermediate data)
what inputs does each script need
are these inputs part of the current data release plan (e.g. are they part of the planned "general data release")
if they aren't, where are these data currently, how are they created.

I believe this will clarify the problem for both of us. Based on this we'll just update our plan for the "general" and "paper related" data releases and then see if that creates any new problems. It may be useful to go back to the "data-flow chart" we created in the beginning to map out the inputs and outputs of each step.

corinnerobert commented 3 years ago

It is the script 5_sample_nmf_to_nii.py As inputs it uses:

the subjects t1t2 maps
average striatum labels (the label needs to be in each subject individual folder in order for the script to work)

These files are part of the "general data release plan", only they are not in the data/ folder

corinnerobert commented 3 years ago

Also, as we are using some python or matlab packages, more specifically:

TractRec package in script 5
Brainlets package in scripts 2
PLS package in scripts 6

Can we just say in the documentation to download those packages and tell the user to adjust the paths to those packages in the scripts?

surchs commented 3 years ago

ah ok. if it's in the general data release then I think we're good. I would suggest something like this:

have the "paper data" / intermediate data AND the "general data" / preprocessed data as two subfolders under /data. Maybe you can come up with some good names for each of them.
Add a note to the documentation that in order to be able to run script 5, the reader will have to get the full general data release. If they don't want to do that (which would be very understandable given the size), we have included the output of script 5 in our paper-data-release.
for your own testing and trialing purposes, just put the general data release as a subdir of the data/ folder and make sure the paths resolve correctly.

surchs commented 3 years ago

Also, as we are using some python or matlab packages, more specifically:
* [TractRec package](https://github.com/CoBrALab/TractREC/tree/master/TractREC) in script 5

* [Brainlets package](https://github.com/asotiras/brainparts) in scripts 2

* [PLS package](http://pls.rotman-baycrest.on.ca/UserGuide.htm) in scripts 6
Can we just say in the documentation to download those packages and tell the user to adjust the paths to those packages in the scripts?

Yeah, that's totally fine. For the python case, you can just add the packages to your requirements.txt if it is published on pypi.org. Your docs should have some kind of "how to setup the compute / processing environment" section (sometimes just called "Installation") where you can list the software requirements as links to the github repos. You are not required to provide a working environment for the reader but it's nice to make sure that they can follow along in your footsteps and with some reasonable work on their own get your code to run.

surchs commented 3 years ago

Also, as we are using some python or matlab packages, more specifically:
* [TractRec package](https://github.com/CoBrALab/TractREC/tree/master/TractREC) in script 5

* [Brainlets package](https://github.com/asotiras/brainparts) in scripts 2

* [PLS package](http://pls.rotman-baycrest.on.ca/UserGuide.htm) in scripts 6
Can we just say in the documentation to download those packages and tell the user to adjust the paths to those packages in the scripts?

Ah, OK. I see what's the issue here. None of these are "installable" in the sense that you can just run a command to have them in your path (edit: what I mean is, you cannot resolve these dependencies with dependency management like pip or Pipenv). For the matlab packages, that's clear and expected. You'll only need to point to their installable files / git repos (and maybe remind readers of the addpath(genpath(..)) stuff to add them. For the TractRec python scripts, you could do one of two things:

Also just tell people to clone the scripts locally and then manually add them to the python path as you have done
You could add the TractRec repo as a git submodule to your lib folder.

The second option has two possible advantages:

a reader can just clone your repo and also get the TractRec scripts through the submodule directly provided they know to run git clone with the recursive flag. that's not guaranteed, you'll probably have to document this very well - git submodules aren't very accessible for git beginners. Importantly, this may also be true for yourself so it's perfectly reasonable for you to decide that this isn't worth your time right now
The other advantage would be that you know where the scripts will live relative to your path so you can just hard-code them into your scripts and the reader won't have to change anything (again, provided they know how to get the submodules, because by default they aren't cloned).

corinnerobert / striatum_micro_nmf

Edit scripts to correspond to the data sharing plan and test scripts #6