Closed nictru closed 7 months ago
First step uses the nf-core/RNA-seq and nf-core/ChIP-seq pipelines and stays the same as in the base version of TFPrioritizer (TFPrio) / INSPECT. However, their output is then no longer piped directly into TRAP but first used by our enhancer prediction tools.
Overview of the enhancer prediction pipeline: The first step of the enhancer prediction module is the execution of eHMM (implemented by Leon Schwartz and already in the pipeline). This was executed as the last step in the previous pipeline version, but is now being run in the beginning. In the longer term, a parallel execution together with ChromHMM and a subsequent consensus vote regarding the predicted enhancer regions is planned, but there are still problems with the execution here. This means we skip the consensus vote and continue with only the predicted enhancers of eHMM.
The predicted enhancer regions are output by eHMM as a BED file and serve directly as input for ROSE. We use ROSE to stitch the enhancers contained in the BED and also to predict super-enhancers. The enhancer regions that are output and stitched by ROSE are then used as input for the further TFPrio/INSPECT pipeline. So far, the further pipeline has been run on all open chromatin regions. By limiting this to the enhancers we become much more specific.
Input genome (MM10 in our case, shipped with ROSE or here: /nfs/data/COM2POSE/inspect_leon/rose/annotation
)
The input genome (annotation directory) has to be in the current working directory where ROSE is executed.
GFF file with enhancer regions (output from eHMM, converted with bed2gff script) The eHMM BED output file has to be converted into a GFF file.
./bed2gff.sh <path input BED> <path output BED>
The GFF is then further converted into the ROSE GFF format, which uses different columns:
./reformat_gff.py --input <path input GFF> --output <path output GFF>
Ranking BAM file from ChIP-Seq MED1 data (Example: /nfs/data/Hennighausen/02_chipseq/L1_broadPeak/bwa/mergedLibrary/WT_L1_MED1_R1.mLb.clN.sorted.bam
)
As ROSE needs the 'chrX' notation instead of just 'X' for chromosomes in its input BAM files, we first have to reformat them.
./reformat_bam.sh <path input BAM> <path output BAM>
Control BAM file from ChIP-Seq WCE/Control data (Example: /nfs/data/Hennighausen/02_chipseq/L1/bwa/mergedLibrary/WT_L1_CONTROL_R1.mLb.clN.sorted.bam
)
The reformatting via the reformat_bam.sh
script has to be applied to this file as well.
The scripts bed2gff.sh
and reformat_bam.sh
should be run inside the normal ubuntu
container already used for other tasks in this project.
The reformat_gff.py
script can be run with the tfprio-python
container and requires pandas
and argparse
to be installed.
ROSE itself can be installed via their GitHub, but comes also shipped with its own Docker container: ghcr.io/stjude/rose:latest
Within the container ROSE can be called the following:
ROSE_main.py -g <input genome from 1.> -I <input GFF from 2.> -r <Ranking BAM from 3.> -c <Control BAM from 4.> -o <Output directory> -s 12500 -t 2500
The ROSE directory is included into the PATH variable of the container, so there is no need to call it with the python3
keyword.
The last two parameters are needed for the internal stitching and are hardcoded with those values.
The most important output from ROSE are the stitched enhancers, which can be found in the file *_Stitched_withSuper.bed
. It is used as input in the downstream pipeline and replaces the earlier used open chromatin regions. The asterisk is chosen based on the file name of the input GFF (from 2.) called *.gff
More information about ROSE in the ROSE GitHub repository.
In parallel with ehmm, we execute a second enhancer prediction tool (ChromHMM) and use the results of both of them in a consensus vote. The plan is to only proceed with enhancers that were predicted by both tools.
ChromHMM is java based and runs on any system supporting java 1.6 or later. It can be downloaded as a zip directory from the ChromHMM website (download link version 1.24). ChromHMM uses BAM files of ChIP-Seq data to predict the enhancer positions. Here, it currently might be best to run it on all BAM files we have (the user inputs) and later test if it is possible to run it only on certain BAM files to speed up runtime.
As ROSE needs the 'chrX' notation instead of just 'X' for chromosomes in its input BAM files, we first have to reformat them.
./reformat_bam.sh <path input BAM> <path output BAM>
After reheadering and sorting the BAM files they need to be reindexed. The index is necessary for the execution of ChromHMM. This can either be done by a simple call of samtools index <BAM file>
or with the script index_bam.sh
(I added it to the bin
directory, but feel free to remove it anytime if you prefer the actual command).
./index_bam.sh <path input BAM>
ChromHMM uses a so-called cellmarkfiletable which is a specific format used by the tool to list the input BAMs, their states and corresponding control BAMs. The format is a tsv file and looks like the following:
cell1 mark1 cell1_mark1.bed cell1_control.bed
cell1 mark2 cell1_mark2.bed cell1_control.bed
cell2 mark1 cell2_mark1.bed cell2_control.bed
cell2 mark2 cell2_mark2.bed cell2_control.bed
For the creation of this file, all the BAM files have to be in directories corresponding to the different stages (in our case L1, L10, P6, P13). Those four directories are then in the input directory which is passed to the script. The python script is called the following:
python3 make_cellmarkfiletable.py --input_dir <path to input dir> --output <path to output file> [--markers <path to markers file>] [--control <control name>]
input_dir
hosts the stage directories which are filled with the BAM files.output
is the path to the resulting cellmarkfiletable.markers
is a path to a text file, having one marker ("H3K27ac", "H3K27me3", "H3K4me1", ...) in each line. This is however optional, some markers are already predefined. Markers in the marker file will be added to those predefined markers.control
sets the name of the control marker. This is defaulted to "CONTROL" which fits for our test data.Execution of the first ChromHMM method which preprocesses the BAM files for the enhancer prediction. It is called by:
java -jar ChromHMM.jar BinarizeBam <chromsome length file> <path to input dir> <cellmarkfiletable> <path to output dir>
chromosome length file
is a precomputed file which is shipped with ChromHMM for several model organisms. In our case we use CHROMSIZES/mm10.txt
cellmarkfiletable
is the file we computed in the previous step and lists the BAM filespath to input dir
is the path to the BAM files (Attention: Now all files of all stages in one single directory. I solved that by linking them, maybe we need to discuss how that works with nextflow and adapt the make_cellmarkfiletable.py
script)path to output dir
is the path where the binarized BAMs are storedExecution of the actual enhancer prediction. Called by:
java -jar ChromHMM.jar LearnModel <path to input dir> <path to output dir> <number of states> <assembly>
path to input dir
contains the input which is the output dir of the last steppath to output dir
will contain the output of the predictionnumber of states
is so far only tested with 10, which is recommended in their vignetteassembly
comes precomputed by ChromHMM for several model organisms. In our case "mm10".After execution of this steps we get the full prediction of ChromHMM, which can then be combined with the ehmm output. However, the consensus vote still has to be implemented (will do that in the next days).
Completed via #68
Please add the prepared scripts to the
bin
directory on thenextflow
branch and add a short description how they should be put together in this issue.