TomMaullin / Confounds

Python implementation of the fMRIb UK Biobank confounds code.
0 stars 0 forks source link


This repository contains the code for UK Biobank Deconfounding python code. For best performance, this code should be run on a high performance computing cluster.


The pyconfounds code requires Python version 3.9.2 or higher to run. To use the pyconfounds code, please pip install like so:

git clone
cd Confounds
pip install .

You must also set up your dask-jobqueue configuration file, which is likely located at ~/.config/dask/jobqueue.yaml. This will require you to provide some details about your HPC system. See here for further detail. For instance, if you are using rescomp your jobqueue.yaml file may look something like this:

    name: dask-worker

    # Dask worker options
    cores: 1                 # Total number of cores per job
    memory: "200GB"                # Total amount of memory per job
    processes: 1                # Number of Python processes per job

    interface: ib0             # Network interface to use like eth0 or ib0
    death-timeout: 60           # Number of seconds to wait if a worker can not find a scheduler
    local-directory: "/path/of/your/choosing/"       # Location of fast local storage like /scratch or $TMPDIR
    log-directory: "/path/of/your/choosing/"
    silence_logs: True

    # SLURM resource manager options
    shebang: "#!/usr/bin/bash"
    queue: short
    project: null
    walltime: '01:00:00'
    job-cpu: null
    job-mem: null
    log-directory: null

    # Scheduler options
    scheduler-options: {'dashboard_address': ':46405'}


To run pyconfounds first specify your settings in config.yml and then run using the below guidelines. Below is a complete list of possible inputs to this file.

Mandatory fields

The following fields are mandatory:

Optional fields

The following fields are optional:


Below is an example of the config.yml file.

  cluster_type: sge
  num_nodes: 100
datadir: /path/to/data/directory/
outdir: /path/to/output/directory/
logfile: /path/to/log.html
MAXMEM: 2**32

Running the Analysis

PyConfounds can be run from the terminal as follows:

pyconfounds <name_of_your_yaml_file>.yml

You can watch your analysis progress either by checking the logfile (see above). To do so, run the following command:

cd /path/to/log/html/
python -m http.server <remote port>

where <remote port> is the port you want to host the file on (e.g. 8701). In a seperate terminal, you must then tunnel into your HPC as follows:

ssh -L <local port>:localhost:<remote port> username@hpc_address

where the local port is the port you want to view on your local machine and the remote port is port hosting the html log file. You should now be able to access the HTML log file in browser by opening http://localhost:<local port>/<your log file>.html. When parallelized computation is being performed using dask, the dask console address is displayed in the log.html file and can be accessed by porting in a similar manner to that described above.


Output from pyconfounds can be given in one of two ways; as a csv file, or as a collection of MemoryMappedDF npz files. A list of files output are as follows:

Filename Description
IDPs.csv(.npz) The initial IDPs.
nonIDPs.csv(.npz) Variables that are used but neither IDPs, nor confounds.
misc.csv(.npz) Miscellanous variables kept for posterity.
confounds.csv(.npz) The initial confounds.
nonconfounds.csv(.npz) The nonlinear confounds.
IDPs_deconf.csv(.npz) The IDPs after deconfounding with inital confounds.
p.csv(.npz) P-values for nonlinear confounds variance explained.
ve.csv(.npz) Variance explained for nonlinear confounds.
nonlinear_confounds_reduced.csv(.npz) Reduced nonlinear confounds.
IDPs_deconf_ct.csv(.npz) The IDPs after deconfounding with nonlinear confounds.
confounds_with_ct.csv(.npz) The confounds with nonlinear and crossed terms.
IDPs_deconf_smooth.csv(.npz) The IDPs after deconfounding with nonlinear and crossed confounds.
confounds_with_smooth.csv(.npz) The confounds with nonlinear, crossed and smoothed terms.

If data are output using the MemoryMappedDF format they may be read into python as follows:

# Import the MemoryMappedDF class
from pyconfounds.memmap.MemoryMappedDF import MemoryMappedDF

# Read in MemoryMappedDF
memory_mapped_df = read_memmap_df(<filename for dataframe>)

The memory mapped dataframe object can be indexed and manipulated in a Jupyter notebook in a number of ways. Here is some example usage:

Example usage:

# Create a dataframe
df = pd.DataFrame({
            'A': range(1, 101),
            'B': np.random.rand(100),
            'C': np.random.randint(1, 100, size=100)

# Memory mapped version
memory_mapped_df = MemoryMappedDF(df)

# Access all elements

# Access data using row index and column names
memory_mapped_df[1:20, ['A', 'B']]

# Access data using natural slicing syntax
memory_mapped_df[1:20, 0:1]
memory_mapped_df[1:20, 0]

# Accessing a single entry
memory_mapped_df[10, 'A']
memory_mapped_df[3, 0]

The MemoryMappedDF has the advantage that it can store metadata such as groupings of variables. You can list variable groupings in the MemoryMappedDF object as follows:


And retrieve groups of variables using:

memory_mapped_df.get_group(<group name>)

You can also search the columns using regular expressions:


If you wish to convert the memory_mapped_df to a csv you can do so using the following command:


Please note: At present, the MemoryMappedDF class is saved across several files, so if you move files next to the main npz file, you may find the MemoryMappedDF can no longer be opened. Also, at present, the locations of files in the MemoryMappedDF objects are hard coded, so it is not recommended to move them around.

Structure of the repository

This repository has the following structure.