austintwang / finemo_gpu

MIT License
5 stars 0 forks source link

finemo_gpu

FiNeMo (Finding Neural network Motifs) is a GPU-accelerated hit caller for identifying occurrences of TFMoDISCo motifs within contribution scores generated by machine learning models.

Installation

Note This software is currently in development and will be available on PyPI once mature. For now, we suggest installing it from source.

Installing from Source

Clone the GitHub Repository

git clone https://github.com/austintwang/finemo_gpu.git
cd finemo_gpu

Create a Conda Environment with Dependencies

This step is optional but recommended

conda env create -f environment.yml -n $ENV_NAME
conda activate $ENV_NAME

Install the Python Package

pip install --editable .

Update an Existing Installation

To update, simply fetch the latest changes from the GitHub repository.

git pull

If needed, update the conda environment with the latest dependencies.

conda env update -f environment.yml -n $ENV_NAME --prune

Data Inputs

Required:

Recommended:

Usage

FiNeMo includes a command-line utility named finemo. Here, we describe basic usage for each subcommand. For detailed usage information, run finemo <subcommand> -h.

Preprocessing

The following commands transform input contributions and sequences into a compressed .npz file for quick loading. This file contains:

Preprocessing commands do not require GPU.

Shared arguments

finemo extract-regions-bw

Extract sequences and contributions from FASTA and bigWig files.

Note BigWig files only provide projected contribution scores. Thus, the output only supports analyses based solely on projected contributions.

Usage: finemo extract-regions-bw -p <peaks> -f <fasta> -b <bigwigs> -o <out_path> [-w <region_width>]

finemo extract-regions-chrombpnet-h5

Extract sequences and contributions from ChromBPNet H5 files.

Usage: finemo extract-regions-chrombpnet-h5 -c <h5s> -o <out_path> [-w <region_width>]

finemo extract-regions-bpnet-h5

Extract sequences and contributions from BPNet H5 files.

Usage: finemo extract-regions-bpnet-h5 -c <h5s> -o <out_path> [-w <region_width>]

finemo extract-regions-modisco-fmt

Extract sequences and contributions from tfmodisco-lite input .npy/.npz files.

Usage: finemo extract-regions-modisco-fmt -s <sequences> -a <attributions> -o <out_path> [-w <region_width>]

Hit Calling

finemo call-hits

Identify hits in input regions using TFMoDISCo CWM's.

Usage: finemo call-hits -r <regions> -m <modisco_h5> -o <out_dir> [-p <peaks>] [-t <cwm_trim_threshold>] [-a <alpha>] [-b <batch_size>] [-J]

Additional notes

Outputs

hits.tsv: The full list of coordinate-sorted hits with the following fields:

hits_unique.tsv: A deduplicated list of hits in the same format as hits.tsv. In cases where peak regions overlap, hits.tsv may list multiple instances of a hit, each linked to a different peak. hits_unique.tsv arbitrarily selects one instance per duplicated hit. This file is generated only if -p/--peaks is specified.

hits.bed: A coordinate-sorted BED file of unique hits, generated only if -p/--peaks is provided. It includes:

peaks_qc.tsv: Per-peak statistics. It includes:

params.json: The parameters used for hit calling.

Output reporting

finemo report

Generate an HTML report (report.html) visualizing TF-MoDISCo seqlet recall and hit distributions. The input regions must have genomic coordinates. If -n/--no-recall is not set, peaks must exactly match those used during the TF-MoDISCo motif discovery process. This command does not utilize the GPU.

Usage: finemo report -r <regions> -H <hits> -p <peaks> -m <modisco_h5> -o <out_dir> [-W <modisco_region_width>] [-t <cwm_trim_threshold>] [-n] [-s]