This is the modeling platform used in the Alber lab (University of California, Los Angeles). For any inquiry, please send an email to Lorenzo (bonimba@g.ucla.edu) (feedbacks and advice are much appreciated!!!). The source code is in the igm
folder.
In a nutshell: IGM generates a population of single cell full genome (diploid/haploid) structures, which fully recapitulate a variety of experimental genomic and/or imaging data. As of now, it does NOT preprocess raw experimental data [details about pre-processing are provided in the main reference, Boninsegna et al, Nature Methods (2022), see below]
NEW: how to go from 'mcool' to 'hcs' HiC input, see below in the README.md
Jan 2022
If you use genome structures generated using this platform OR you use the platform to generate your own structures, please use the following reference for citing purposes:
Boninsegna, L., Yildirim, A., Polles, G. et al. Integrative genome modeling platform reveals essentiality of rare contact events in 3D genome organizations. Nat Methods 19, 938–949 (2022). https://doi.org/10.1038/s41592-022-01527-x.
igm
: full IGM code(s)bin
: IGM run, server and GUI scripts. In particular, refer to igm-run.sh
(actual submission script) and igm-report.sh
(post-processing automated script). GUI scripts have been discontinued.demo
: example inputs (.hcs, .json files) for demo runHPC_scripts
: create ipyparallel environment and submit actual IGM run on a SGE scheduler based HpC clusterigm-run_scheme.pdf
: is a schematic which breaks down the different computing levels of IGM and tries to explain how the different parts of the code are related to one another.IGM_documentation.pdf
: documentation (in progress)igm-config_all.json
: most comprehensive configuration file which shows parameters for all data sets that can be accommodatedIGM not longer supports python 2, so you'll need a python3 environment. The package depends on a number of other libraries, most of them publicly available on pip. In addition, some other packages are required:
Many of the alabtools and IGM dependencies can be installed with a few commands if you are using conda (https://www.anaconda.com/distribution/)
Please note, we are running conda versions back from 2019. More recent versions might cause compatibility issues.
# optional - create a new environment for igm
conda create -n igm python=3.6
source activate igm
# install dependencies
conda install pandas swig cython cgal==4.14 hdf5 h5py numpy scipy matplotlib \
tornado ipyparallel cloudpickle
cgal
version needs to be 4.14, there are some compatibility issues with latest 5.0 version.If you really do not want to use conda, most of the packages can be installed with pip, but it is up to you to download and build cgal and hdf5, and eventually set the correct include/library paths during installation.
Install alabtools (github.com/alberlab/alabtools)
pip install git+https://github.com/alberlab/alabtools.git
Note: on windows, conda CGAL generates the library, but the name depends on the build, e.g CGAL-vc140-mt-4.12.lib. Go to
Install IGM
pip install git+https://github.com/alberlab/igm.git
Download and build a serial binary of the modified LAMMPS version
git clone https://github.com/alberlab/lammpgen.git
cd lammpgen/src
make yes-user-genome
make yes-molecule
make serial
# create a user defaults file with the path of the executable
mkdir -p ${HOME}/.igm
echo "[DEFAULT]" > ${HOME}/.igm/user_defaults.cfg
echo "optimization/kernel_opts/lammps/lammps_executable = "$(pwd)/src/lmp_serial >> ${HOME}/.igm/user_defaults.cfg
If all the dependencies have been installed correctly, successful code installation should only take a few minutes.
If igm
installation is successful, typing igm
from the command line + TAB should show the different options (igm-run
, igm-report
, etc.)
In order to generate population of structures, the code has to be run in parallel mode, and High Performance Computing is necessary. The scripts to do that on a SGE scheduler-based HPC resources are provided in the HCP_scripts
folder. Just to get an estimate, using 250 independent cores allows generating a 1000 200 kb resolution structure population in 10-15 hour computing time, which can vary depending on the number of different data sources that are used and on the number of iterations one decides to run.
Populations of 5 or 10 structures at 200kb resolution (which is the current highest resolution we simulated) could in principle be generated serially on a "normal" desktop computer, but they have little statistical relevance. For example, 10 structures would only allow to deconvolute Hi-C contacts with probability larger than 10%, which is not sufficient for getting realistic populations. Serial executions are appropriate only at much lower resolution, as the computing burden is also much lower (an example is provided in the demo
folder, see also Software demo)
Due to the necessity of HPC resources, we strongly recommend that the software be installed and run in a Linux environment. ALL the populations we have generated and analyzed were generated using a Linux environment. We cannot guarantee full functionality on a MacOS or Windows.
In order to run IGM to generate a population which uses a given combination of data sources, the igm-config.json
file needs to be edited accordingly, by specifying the input files and adding/removing the parameters for each data source when applicable (a detailed description of the different entries that are available is given under igm/core/defaults
). Then, software can be run using igm-run igm-config.json
. Specifically:
Go into igm-config.json
file (or your config file) and edit optimization/kernel_opts/lammps/lammps_executable
so that it points to the actual lammps executable file being installed (see Installation on Linux)
igm-config.json
(or your config) file and set parallel/controller
to "serial". Then execute IGM (from the command line or by submitting a serial job to HPC cluster) by typing igm-run config_file.json >> output.txt
. igm-config.json
file and set parallel/controller
to "ipyparallel" and then follow the steps detailed in HPC_scripts\steps_to_submit_IGM.txt
file and in the documentation, which rely on scripts also in the HPC_scripts
folder. Specifically: create a running ipcluster environment (bash create_ipcluster_environment.sh
followed by qsub submit_engines.sh
) and only then submit the actual IGM calculation (qsub submit_igm.sh
), which executes the igm-run igm-config.json
command, i.e.bash create_ipcluster_environment.sh
qsub submit_engines.sh
qsub submit_igm.sh
[Commands and sintax will need to be adapted if different scheduler than SGE is available]
igm.log
and stepdb.splite
files, a number of temporary files from the Assignment Steps and finally a sequence of intermediate .hss genome populations, each resulting from a different A/M iteration (see IGM_documentation.pdf
). The file igm-model.hss
will contain the optimized population at the end of the pipeline. hss files can be read conveniently using the alabtools
package which was mentioned already.err_igm
file with details about the reason why the run crashed. If a run accidentally crashes (like, a node goes down), resubmitting the calculation using qsub submit_igm.sh
(assuming the ipcluster environment is still up and running) will pick up exactly where the previous run left off. However, if a fresh new calculation has to start from the top, please make sure all the temporary files (including the database stepdb.splite
) and the tmp
folder are removed before submitting.In order to get familiar with the configuration file and the code execution, we provide a config_file.json
demo configuration file for running a 2Mb resolution WTC11 population using Hi-C data only: that is found in the demo
folder.
A comprehensive configuration file igm-config_all.json
for running a HFF population with all data types (Hi-C, lamina DamID, SPRITE and 3D HIPMap FISH) is also provided here as a reference/template. Clearly, each user must specify their own input files.
Sample files at provided to simulate a Hi-C only population of WTC11 (spherical nucleus) at 2Mb resolution, to get familiar with the basics of the code
Enter the demo
folder: data and scripts for a 2Mb IGM calculation with Hi-C restraints are provided;
.hcs
file is a 2Mb resolution Hi-C contact mapconfig_file.json
is the .json configuration file with all the parameters needed for the calculation. In particular, we generate 100 structures, which means the lowest contact probability we can target is 0.01 (1 %). For different setups, we recommend using different names for the configuration file to avoid confusion. Whatever name is chosen, it will have to be updated when running the scripts.igm-run config_file.json
), either serially or in parallel; the serial calculation (on a normal computer) all the way down to 1% probability should be completed in a few hours.The alabtools
package is required to easily generate a suitable '.hcs' HiC input file that can be fed to IGM. It is a three step procedure:
.mcool
file, extract the matrix of contact frequencies.hcs
file, IGM-compatible formatOur own preprocessing pipeline is peculiar to the lab and detailed in the Supporting Information to the Nat Methods paper, and we will share that soon (still curating the scripts). However, any preprocessing steps that generate a balanced/filtered/appropriate matrix of contact probabilities at the desired genome model resolution (e.g., 200kpb, 1Mbp, etc) can be used, according to experience/need.
import alabtools, numpy, scipy
# Read in .mcool file at a give resolution
m = alabtools.Contactmatrix(ZZZ, resolution = XXX, genome = YYY)
A = m.matrix.toarray()
where ZZZ
= name of the .mcool file (string) , XXX
= model resolution in kb (integer), YYY
= genome segmentation (string), currently alabtools
allows for 'hg19', 'hg38' (human) and 'mm9' (mouse) genome types.
m
is an alabtools matrix object. m.matrix
is a sparse (SSS) matrix, which can be converted to regular NumPy array by the command m.matrix.toarray()
.
Now, you can perform any preprocessing on A = m.matrix.toarray()
you see fit (filtering, matrix balancing, etc) to prepare the input (see preamble). After you are satisfied with your preprocessing, it is easy to store the updated/preprocessed (contact probability) NumPy array A
by:
# convert NumPy array back to sparse sss_matrix
spm = scipy.sparse.csr_matrix(A)
m.matrix = alab.matrix.sss_matrix((spm.data, spm.indices, spm.indptr))
# save matrix to 'hcs' file
m.save('preprocessed_hic.hcs')
The 'preprocessed_hic.hcs' file has now the correct format to be fed to the IGM modeling pipeline.
Installation on MacOS poses additional challenges, especially on 11.14 Mojave (updated Sept 2019). Standard GNU gcc
compiler may not be pre-installed; instead, the more efficient clang
might be (this can be checked with gcc --version
):
$ which gcc
/usr/bin/gcc
$ gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.3.0 (clang-703.0.29)
Target: x86_64-apple-darwin15.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
If you are getting this printout, then there is NO actual gcc installed. In order to circumvent that, the following procedure worked for me:
First install gcc
using Homebrew
:
brew install gcc@9
A gcc compiler will be installed, but we still need to make sure it supercedes the default clang
, anytime the C compiler is called. Assume the 9 version was installed, then the default installation path reads /usr/local/Cellar/gcc/9.0.2/
Make sure the default gcc compiler points to that folder. This is the tricky part, since editing the PATH variable does not seem to always work. Renaming the executables seems to work but, again, no guarantee (see for instance https://stackoverflow.com/questions/28970935/osx-replace-gcc-version-4-2-1-with-4-9-installed-via-homebrew/28982564
)
Then, alabtools
can be installed in the regular way (and igm
also)