This is the open source code for YOLOOP.
Note: The current version of our code is for a demo to facilitate the review process. All source materials to develop our model will be released upon publication, including all the codes, data and models for pre-training, evaluation and adaptive fine-tuning.
First, download the code from GitHub and move to the working directory.
git clone https://github.com/WangJiuming/YOLOOP.git
cd YOLOOP
Then, install the prerequisite packages from the environment.yml
, which specifies the required packages for using YOLOOP. We recommend use anaconda for installing the dependencies into a virtual environment.
conda env create --name yoloop --file environment.yml
By default, we installed the latest PyTorch with CUDA version 12.1. If your local environment configurations is different, please install based on the PyTorch Installation Guide.
After installing all the packages successfully, you may proceed to activate the environment as usual. The name of the virtual environment is "yoloop".
conda activate yoloop
Now, you are ready to use YOLOOP!
YOLOOP performs extremely efficient loop detection across contact maps obtained with various sequencing protocols and from various cell lines. All the datasets (i.e., contact maps and loop annotations) are in the public domain. Their sources and access numbers are listed in the Supplementary Information of our paper. In the meanwhile, please also feel free to use any of your own datasets!
YOLOOP supports one of the currently most commonly used file formats of contact maps, cooler. You can find more about it at its official documentation. For a better performance, we highly recommend use it for less memory IO overhead. If only hic format is available, you may also check out this very convenient tool to convert it from hic to cool.
In the following tutorial, we will use the GM12878 dataset downloaded from 4DN Portal with access number 4DNFIXP4QG5B.
mkdir -p data
wget -P ./data https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/d6abea45-b0bb-4154-9854-1d3075b98097/4DNFIXP4QG5B.mcool
We have included one model checkpoint at ./models/gm12878_hic_10kb.pt
. This model was pre-trained on the GM12878 Hi-C contact map with CTCF ChIA-PET interactions at 10kb resolution.
After obtaining the model checkpoint, you are ready to perform chromatin loop detection efficiently with YOLOOP by calling the detection procedure. A standard calling would be as the following.
python detect.py --cm ./data/4DNFIXP4QG5B.mcool --r 10000 --model ./models/gm12878_hic_10kb.pt --out ./results
The program will detect cuda devices automatically, and we strongly suggest use cuda for a much better performance. Besides setting the paths, here are also several hyperparameters that we may tune. A complete configuration of the procedure would be as follows.
usage: detect.py [-h] [--cm CM] [-r RESOLUTION] [-b] [-m MODEL] [--out OUT]
[-t THRESH] [--device DEVICE]
YOLOOP for efficient chromatin loop detection
optional arguments:
-h, --help show this help message and exit
--cm CM path to the input contact matrix with .mcool/.cool extension
-r RESOLUTION, --resolution RESOLUTION resolution of the contact matrix
-b, --balance whether to use a balanced matrix or not
-m MODEL, --model MODEL YOLOOP model checkpoint to be loaded
--out OUT output directory for saving the prediction results
-t THRESH, --thresh THRESH threshold for the confidence score
--device DEVICE device to be used, e.g., cuda, cuda:0, cpu
After the detection is complete, the results will be saved in a .bedpe file in the specified directory. An example of the prediction results is as follows.
chr1 610000 620000 chr1 37880000 37890000 0.760546875
The above example consists of seven columns. The first three columns indicate the x-coordiante of the loop and the following three columns indicate the y-coordinate. The last column shows the confidence level of the prediction.
Reproducible run for custom scripts can be found under reproducibility/
.