alberlab / igm

Integrated Genome Modeling
GNU General Public License v3.0
11 stars 5 forks source link

IGM: An Integrated Genome Modeling Platform

This is the modeling platform used in the Alber lab (University of California, Los Angeles). For any inquiry, please send an email to Lorenzo (bonimba@g.ucla.edu) (feedbacks and advice are much appreciated!!!). The source code is in the igm folder.

In a nutshell: IGM generates a population of single cell full genome (diploid/haploid) structures, which fully recapitulate a variety of experimental genomic and/or imaging data. As of now, it does NOT preprocess raw experimental data [details about pre-processing are provided in the main reference, Boninsegna et al, Nature Methods (2022), see below]

NEW: how to go from 'mcool' to 'hcs' HiC input, see below in the README.md

Jan 2022

Cite

If you use genome structures generated using this platform OR you use the platform to generate your own structures, please use the following reference for citing purposes:

Boninsegna, L., Yildirim, A., Polles, G. et al. Integrative genome modeling platform reveals essentiality of rare contact events in 3D genome organizations. Nat Methods 19, 938–949 (2022). https://doi.org/10.1038/s41592-022-01527-x.

Repository Organization

Dependencies

IGM not longer supports python 2, so you'll need a python3 environment. The package depends on a number of other libraries, most of them publicly available on pip. In addition, some other packages are required:

Installation on linux

Important notes

Using IGM

In order to generate population of structures, the code has to be run in parallel mode, and High Performance Computing is necessary. The scripts to do that on a SGE scheduler-based HPC resources are provided in the HCP_scripts folder. Just to get an estimate, using 250 independent cores allows generating a 1000 200 kb resolution structure population in 10-15 hour computing time, which can vary depending on the number of different data sources that are used and on the number of iterations one decides to run.

Populations of 5 or 10 structures at 200kb resolution (which is the current highest resolution we simulated) could in principle be generated serially on a "normal" desktop computer, but they have little statistical relevance. For example, 10 structures would only allow to deconvolute Hi-C contacts with probability larger than 10%, which is not sufficient for getting realistic populations. Serial executions are appropriate only at much lower resolution, as the computing burden is also much lower (an example is provided in the demo folder, see also Software demo)

Due to the necessity of HPC resources, we strongly recommend that the software be installed and run in a Linux environment. ALL the populations we have generated and analyzed were generated using a Linux environment. We cannot guarantee full functionality on a MacOS or Windows.

In order to run IGM to generate a population which uses a given combination of data sources, the igm-config.json file needs to be edited accordingly, by specifying the input files and adding/removing the parameters for each data source when applicable (a detailed description of the different entries that are available is given under igm/core/defaults). Then, software can be run using igm-run igm-config.json. Specifically:

In order to get familiar with the configuration file and the code execution, we provide a config_file.json demo configuration file for running a 2Mb resolution WTC11 population using Hi-C data only: that is found in the demo folder.

A comprehensive configuration file igm-config_all.json for running a HFF population with all data types (Hi-C, lamina DamID, SPRITE and 3D HIPMap FISH) is also provided here as a reference/template. Clearly, each user must specify their own input files.

Software demo

Sample files at provided to simulate a Hi-C only population of WTC11 (spherical nucleus) at 2Mb resolution, to get familiar with the basics of the code

The alabtools package is required to easily generate a suitable '.hcs' HiC input file that can be fed to IGM. It is a three step procedure:

Our own preprocessing pipeline is peculiar to the lab and detailed in the Supporting Information to the Nat Methods paper, and we will share that soon (still curating the scripts). However, any preprocessing steps that generate a balanced/filtered/appropriate matrix of contact probabilities at the desired genome model resolution (e.g., 200kpb, 1Mbp, etc) can be used, according to experience/need.

import alabtools, numpy, scipy

# Read in .mcool file at a give resolution
m = alabtools.Contactmatrix(ZZZ, resolution = XXX, genome = YYY)

A = m.matrix.toarray()

where ZZZ = name of the .mcool file (string) , XXX = model resolution in kb (integer), YYY = genome segmentation (string), currently alabtools allows for 'hg19', 'hg38' (human) and 'mm9' (mouse) genome types.

m is an alabtools matrix object. m.matrix is a sparse (SSS) matrix, which can be converted to regular NumPy array by the command m.matrix.toarray().

Now, you can perform any preprocessing on A = m.matrix.toarray() you see fit (filtering, matrix balancing, etc) to prepare the input (see preamble). After you are satisfied with your preprocessing, it is easy to store the updated/preprocessed (contact probability) NumPy array A by:

# convert NumPy array back to sparse sss_matrix

spm = scipy.sparse.csr_matrix(A)
m.matrix = alab.matrix.sss_matrix((spm.data, spm.indices, spm.indptr))

# save matrix to 'hcs' file
m.save('preprocessed_hic.hcs')

The 'preprocessed_hic.hcs' file has now the correct format to be fed to the IGM modeling pipeline.

Installation on MacOS (updated, Spt 2019)

If you are getting this printout, then there is NO actual gcc installed. In order to circumvent that, the following procedure worked for me: