datalad / datalad-ukbiobank

Resources for working with UKBiobank as a DataLad dataset
MIT License
6 stars 12 forks source link

download of 1000 subject subset with condor #8

Closed loj closed 4 years ago

loj commented 4 years ago

I tested out the ukb_create_participant_ds and ukb_update_participant_ds scripts created by @mih using condor to download a 1000 subject subset.

To start, I created a csv file with a list of the subjects and modalities that I wanted.

0001234,20227_2_0,20249_2_0,20252_2_0 0001235,20227_2_0,20249_2_0,20252_2_0 0001236,20227_2_0,20249_2_0,20252_2_0 0001237,20227_2_0,20249_2_0,20252_2_0 0001238,20227_2_0,20249_2_0,20252_2_0 0001239,20227_2_0,20249_2_0,20252_2_0 ...

Then, I used the following to call the scripts and submit jobs to condor:

To create the single-participant datasets: ./ukb_create_submit_gen.sh | condor_submit

ukb_create_submit_gen.sh

``` #!/bin/sh logs_dir=~/logs/ukb/create # create the logs dir if it doesn't exist [ ! -d "$logs_dir" ] && mkdir -p "$logs_dir" # print the .submit header printf "# The environment universe = vanilla getenv = True request_cpus = 1 request_memory = 1G # Execution initial_dir = /data/project/rehab_biobank/1000_subset/ executable = /data/project/rehab_biobank/1000_subset/ukb_create_participant_ds \n" # create a job for each subject for line in $(cat subset_rfrmi_tfrmi_t1.csv); do subject_id=${line%%,*} && line=${line#${subject_id},} modalities=$(echo ${line} | sed 's/,/ /g') printf "arguments = ${subject_id} ${subject_id} ${modalities}\n" printf "log = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).log\n" printf "output = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).out\n" printf "error = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).err\n" printf "Queue\n\n" done ```

To download the data: ./ukb_update_submit_gen.sh | condor_submit

ukb_update_submit_gen.sh

``` #!/bin/sh logs_dir=~/logs/ukb/update # create the logs dir if it doesn't exist [ ! -d "$logs_dir" ] && mkdir -p "$logs_dir" # print the .submit header printf "# The environment universe = vanilla getenv = True request_cpus = 1 request_memory = 1G # Execution initial_dir = /data/project/rehab_biobank/1000_subset/ executable = /data/project/rehab_biobank/1000_subset/ukb_update_participant_ds \n" # create a job for each subject for line in $(cat subset_rfrmi_tfrmi_t1.csv); do subject_id=${line%%,*} && line=${line#${subject_id},} printf "arguments = ${subject_id} ../.ukbkey\n" printf "log = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).log\n" printf "output = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).out\n" printf "error = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).err\n" printf "Queue\n\n" done ```

mih commented 4 years ago

Thanks for sharing your code @loj !

Let's ask @AlexandreHutton if he would share his code for reorganizing these downloads into a BIDS compliant form. That would make an initial set of helper tools complete. Thanks in advance!

AlexandreHutton commented 4 years ago

Working on it; the scripts were for dealing with an intermediate format that I didn't realize; I'm fixing it to work directly on the downloaded format; it should be up by Monday.

mih commented 4 years ago

@AlexandreHutton Wonderful, thanks much!

AlexandreHutton commented 4 years ago

PR #9 created. It adds the scripts as a submodule; I had some difficulties adding the files directly, and this seemed like the easiest solution. Alternatives are fine with me. Note that some of the contents come from another package under Apache 2.0, so it may be worth keeping those files separate regardless.

mih commented 4 years ago

I think we can close this now. Thx much!