daisybio / TF-Prioritizer

Bioinformatics pipeline to identify differentially active transcription factors between conditions using expression and epigenetic data
GNU General Public License v3.0
13 stars 0 forks source link

Add scripts for enhancer processing #63

Closed nictru closed 7 months ago

nictru commented 11 months ago

Please add the prepared scripts to the bin directory on the nextflow branch and add a short description how they should be put together in this issue.

LeonHafner commented 11 months ago

Current overview of the INSPECT pipeline:

Screenshot 2023-09-27 at 09 53 56 First step uses the nf-core/RNA-seq and nf-core/ChIP-seq pipelines and stays the same as in the base version of TFPrioritizer (TFPrio) / INSPECT. However, their output is then no longer piped directly into TRAP but first used by our enhancer prediction tools.

Overview of the enhancer prediction pipeline: The first step of the enhancer prediction module is the execution of eHMM (implemented by Leon Schwartz and already in the pipeline). This was executed as the last step in the previous pipeline version, but is now being run in the beginning. In the longer term, a parallel execution together with ChromHMM and a subsequent consensus vote regarding the predicted enhancer regions is planned, but there are still problems with the execution here. This means we skip the consensus vote and continue with only the predicted enhancers of eHMM.

The predicted enhancer regions are output by eHMM as a BED file and serve directly as input for ROSE. We use ROSE to stitch the enhancers contained in the BED and also to predict super-enhancers. The enhancer regions that are output and stitched by ROSE are then used as input for the further TFPrio/INSPECT pipeline. So far, the further pipeline has been run on all open chromatin regions. By limiting this to the enhancers we become much more specific.

ROSE uses the following data as input:

  1. Input genome (MM10 in our case, shipped with ROSE or here: /nfs/data/COM2POSE/inspect_leon/rose/annotation) The input genome (annotation directory) has to be in the current working directory where ROSE is executed.

  2. GFF file with enhancer regions (output from eHMM, converted with bed2gff script) The eHMM BED output file has to be converted into a GFF file.

    ./bed2gff.sh <path input BED> <path output BED>

    The GFF is then further converted into the ROSE GFF format, which uses different columns:

    ./reformat_gff.py --input <path input GFF> --output <path output GFF> 
  3. Ranking BAM file from ChIP-Seq MED1 data (Example: /nfs/data/Hennighausen/02_chipseq/L1_broadPeak/bwa/mergedLibrary/WT_L1_MED1_R1.mLb.clN.sorted.bam) As ROSE needs the 'chrX' notation instead of just 'X' for chromosomes in its input BAM files, we first have to reformat them.

    ./reformat_bam.sh <path input BAM> <path output BAM>
  4. Control BAM file from ChIP-Seq WCE/Control data (Example: /nfs/data/Hennighausen/02_chipseq/L1/bwa/mergedLibrary/WT_L1_CONTROL_R1.mLb.clN.sorted.bam) The reformatting via the reformat_bam.shscript has to be applied to this file as well.

The scripts bed2gff.sh and reformat_bam.sh should be run inside the normal ubuntu container already used for other tasks in this project. The reformat_gff.py script can be run with the tfprio-python container and requires pandas and argparse to be installed.

ROSE itself can be installed via their GitHub, but comes also shipped with its own Docker container: ghcr.io/stjude/rose:latest

Within the container ROSE can be called the following:

ROSE_main.py -g <input genome from 1.> -I <input GFF from 2.> -r <Ranking BAM from 3.> -c <Control BAM from 4.> -o <Output directory> -s 12500 -t 2500

The ROSE directory is included into the PATH variable of the container, so there is no need to call it with the python3 keyword. The last two parameters are needed for the internal stitching and are hardcoded with those values.

The most important output from ROSE are the stitched enhancers, which can be found in the file *_Stitched_withSuper.bed. It is used as input in the downstream pipeline and replaces the earlier used open chromatin regions. The asterisk is chosen based on the file name of the input GFF (from 2.) called *.gff

More information about ROSE in the ROSE GitHub repository.

LeonHafner commented 10 months ago

In parallel with ehmm, we execute a second enhancer prediction tool (ChromHMM) and use the results of both of them in a consensus vote. The plan is to only proceed with enhancers that were predicted by both tools.

ChromHMM

ChromHMM is java based and runs on any system supporting java 1.6 or later. It can be downloaded as a zip directory from the ChromHMM website (download link version 1.24). ChromHMM uses BAM files of ChIP-Seq data to predict the enhancer positions. Here, it currently might be best to run it on all BAM files we have (the user inputs) and later test if it is possible to run it only on certain BAM files to speed up runtime.

  1. Bam files might have the wrong chromosome format and need to be converted (reheadered) first. This uses the same scripts as the reheadering for the ROSE algorithm. See here:

As ROSE needs the 'chrX' notation instead of just 'X' for chromosomes in its input BAM files, we first have to reformat them.

./reformat_bam.sh <path input BAM> <path output BAM>
  1. After reheadering and sorting the BAM files they need to be reindexed. The index is necessary for the execution of ChromHMM. This can either be done by a simple call of samtools index <BAM file> or with the script index_bam.sh (I added it to the bin directory, but feel free to remove it anytime if you prefer the actual command).

    ./index_bam.sh <path input BAM>
  2. ChromHMM uses a so-called cellmarkfiletable which is a specific format used by the tool to list the input BAMs, their states and corresponding control BAMs. The format is a tsv file and looks like the following:

    cell1   mark1   cell1_mark1.bed cell1_control.bed
    cell1   mark2   cell1_mark2.bed cell1_control.bed
    cell2   mark1   cell2_mark1.bed cell2_control.bed
    cell2   mark2   cell2_mark2.bed cell2_control.bed

    For the creation of this file, all the BAM files have to be in directories corresponding to the different stages (in our case L1, L10, P6, P13). Those four directories are then in the input directory which is passed to the script. The python script is called the following:

    python3 make_cellmarkfiletable.py --input_dir <path to input dir> --output <path to output file> [--markers <path to markers file>] [--control <control name>]
    • input_dir hosts the stage directories which are filled with the BAM files.
    • output is the path to the resulting cellmarkfiletable.
    • markers is a path to a text file, having one marker ("H3K27ac", "H3K27me3", "H3K4me1", ...) in each line. This is however optional, some markers are already predefined. Markers in the marker file will be added to those predefined markers.
    • control sets the name of the control marker. This is defaulted to "CONTROL" which fits for our test data.
  3. Execution of the first ChromHMM method which preprocesses the BAM files for the enhancer prediction. It is called by:

    java -jar ChromHMM.jar BinarizeBam <chromsome length file> <path to input dir> <cellmarkfiletable> <path to output dir>
    • chromosome length file is a precomputed file which is shipped with ChromHMM for several model organisms. In our case we use CHROMSIZES/mm10.txt
    • cellmarkfiletable is the file we computed in the previous step and lists the BAM files
    • path to input dir is the path to the BAM files (Attention: Now all files of all stages in one single directory. I solved that by linking them, maybe we need to discuss how that works with nextflow and adapt the make_cellmarkfiletable.py script)
    • path to output dir is the path where the binarized BAMs are stored
  4. Execution of the actual enhancer prediction. Called by:

    java -jar ChromHMM.jar LearnModel <path to input dir> <path to output dir> <number of states> <assembly>
    • path to input dir contains the input which is the output dir of the last step
    • path to output dir will contain the output of the prediction
    • number of states is so far only tested with 10, which is recommended in their vignette
    • assembly comes precomputed by ChromHMM for several model organisms. In our case "mm10".

After execution of this steps we get the full prediction of ChromHMM, which can then be combined with the ehmm output. However, the consensus vote still has to be implemented (will do that in the next days).

nictru commented 7 months ago

Completed via #68