finemo_gpu

FiNeMo (Finding Neural network Motifs) is a GPU-accelerated hit caller for identifying occurrences of TFMoDISCo motifs within contribution scores generated by machine learning models.

Installation

Note This software is currently in development and will be available on PyPI once mature. For now, we suggest installing it from source.

Installing from Source

Clone the GitHub Repository

git clone https://github.com/austintwang/finemo_gpu.git
cd finemo_gpu

Create a Conda Environment with Dependencies

This step is optional but recommended

conda env create -f environment.yml -n $ENV_NAME
conda activate $ENV_NAME

Install the Python Package

pip install --editable .

Update an Existing Installation

To update, simply fetch the latest changes from the GitHub repository.

git pull

If needed, update the conda environment with the latest dependencies.

conda env update -f environment.yml -n $ENV_NAME --prune

Data Inputs

Required:

Contribution scores for peak sequences in bigWig format, ChromBPNet H5 format, BPNet H5 format, or tfmodisco-lite input format.
Motif CWMs in tfmodisco-lite H5 output format.

Recommended:

Peak region coordinates in uncompressed ENCODE NarrowPeak format.

Usage

FiNeMo includes a command-line utility named finemo. Here, we describe basic usage for each subcommand. For detailed usage information, run finemo <subcommand> -h.

Preprocessing

The following commands transform input contributions and sequences into a compressed .npz file for quick loading. This file contains:

sequences: A one-hot-encoded sequence array (np.int8) with dimensions (n, 4, w), where n is the number of regions, and w is the width of each region. Bases are ordered as ACGT.
contribs: A contribution score array (np.float16) with dimensions (n, 4, w) for hypothetical scores or (n, w) for projected scores only.

Preprocessing commands do not require GPU.

Shared arguments

-o/--out-path: The path to the output .npz file.
-w/--region-width: The width of the input region centered around each peak summit. Default is 1000.

`finemo extract-regions-bw`

Extract sequences and contributions from FASTA and bigWig files.

Note BigWig files only provide projected contribution scores. Thus, the output only supports analyses based solely on projected contributions.

Usage: finemo extract-regions-bw -p <peaks> -f <fasta> -b <bigwigs> -o <out_path> [-w <region_width>]

-p/--peaks: A peak regions file in ENCODE NarrowPeak format.
-f/--fasta: A genome FASTA file. If an .fai index file doesn't exist in the same directory, it will be created.
-b/--bigwigs: One or more bigWig files of contribution scores, with paths delimited by whitespace. Scores are averaged across files.

`finemo extract-regions-chrombpnet-h5`

Extract sequences and contributions from ChromBPNet H5 files.

Usage: finemo extract-regions-chrombpnet-h5 -c <h5s> -o <out_path> [-w <region_width>]

-c/--h5s: One or more H5 files of contribution scores, with paths delimited by whitespace. Scores are averaged across files.

`finemo extract-regions-bpnet-h5`

Extract sequences and contributions from BPNet H5 files.

Usage: finemo extract-regions-bpnet-h5 -c <h5s> -o <out_path> [-w <region_width>]

-c/--h5s: One or more H5 files of contribution scores, with paths delimited by whitespace. Scores are averaged across files.

`finemo extract-regions-modisco-fmt`

Extract sequences and contributions from tfmodisco-lite input .npy/.npz files.

Usage: finemo extract-regions-modisco-fmt -s <sequences> -a <attributions> -o <out_path> [-w <region_width>]

-s/--sequences: A .npy or .npz file containing one-hot encoded sequences.
-a/--attributions: One or more .npy or .npz files of hypothetical contribution scores, with paths delimited by whitespace. Scores are averaged across files.

Hit Calling

`finemo call-hits`

Identify hits in input regions using TFMoDISCo CWM's.

Usage: finemo call-hits -r <regions> -m <modisco_h5> -o <out_dir> [-p <peaks>] [-t <cwm_trim_threshold>] [-a <alpha>] [-b <batch_size>] [-J]

-r/--regions: A .npz file of input sequences and contributions. Created with a finemo extract-regions-* command.
-m/--modisco-h5: A tfmodisco-lite output H5 file of motif patterns.
-o/--out-dir: The path to the output directory.
-p/--peaks: A peak regions file in ENCODE NarrowPeak format, exactly matching the regions specified in --regions.
-t/--cwm-trim-threshold: The threshold to determine motif start and end positions within the full CWMs. Default is 0.3.
-a/--alpha: The L1 regularization weight. Default is 0.7.
-b/--batch-size: The batch size used for optimization. Default is 2000.
-J/--compile: Enable JIT compilation for faster execution. This option may not work on older GPUs.

Additional notes

The -a/--alpha controls the sensitivity of the hit-calling algorithm, with higher values resulting in fewer but more confident hits. This parameter represents the minimum correlation between a query contribution score window and a CWM to be considered a hit. The default value of 0.7 typically works well for chromatin accessiblity data. ChIP-Seq data may require a lower value (e.g. 0.6).
The -t/--cwm-trim-threshold parameter sets the maximum relative contribution score in trimmed-out CWM flanks. If you find that motif flanks are being trimmed too aggressively, consider lowering this value. However, a too-high value may result in closely-spaced motif instances being missed.
Set -b/--batch-size to the largest value your GPU memory can accommodate. If you encounter GPU out-of-memory errors, try lowering this value.
Legacy TFMoDISCo H5 files can be updated to the newer TFMoDISCo-lite format with the modisco convert command found in the tfmodisco-lite package.

Outputs

hits.tsv: The full list of coordinate-sorted hits with the following fields:

chr: Chromosome name. NA if peak coordinates (-p/--peaks) are not provided.
start: Hit start coordinate from trimmed CWM, zero-indexed. Absolute if peak coordinates are provided, otherwise relative to the input region.
end: Hit end coordinate from trimmed CWM, zero-indexed, exclusive. Absolute if peak coordinates are provided, otherwise relative to the input region.
start_untrimmed: Hit start coordinate from trimmed CWM, zero-indexed. Absolute if peak coordinates are provided, otherwise relative to the input region.
end_untrimmed: Hit end coordinate from trimmed CWM, zero-indexed,exclusive. Absolute if peak coordinates are provided, otherwise relative to the input region.
motif_name: The hit motif name as specified in the provided tfmodisco H5 file.
hit_coefficient: The regression coefficient for the hit, normalized per peak region.
hit_coefficient_global: The regression coefficient for the hit, scaled by the overall importance of the region. This is the primary hit score.
hit_correlation: The correlation between the untrimmed CWM and the contribution score of the motif hit.
hit_importance: The total absolute contribution score within the motif hit.
strand: The orientation of the hit (+ or -).
peak_name: The name of the peak region containing the hit, taken from the name field of the input peak data. NA if -p/--peaks is not provided.
peak_id: The numerical index of the peak region containing the hit.

hits_unique.tsv: A deduplicated list of hits in the same format as hits.tsv. In cases where peak regions overlap, hits.tsv may list multiple instances of a hit, each linked to a different peak. hits_unique.tsv arbitrarily selects one instance per duplicated hit. This file is generated only if -p/--peaks is specified.

hits.bed: A coordinate-sorted BED file of unique hits, generated only if -p/--peaks is provided. It includes:

chr: Chromosome name.
start: Hit start coordinate from trimmed CWM, zero-indexed.
end: Hit end coordinate from trimmed CWM, zero-indexed, exclusive.
motif_name: Hit motif name, taken from the provided tfmodisco H5 file.
score: The hit_correlation score, multiplied by 1000 and cast to an integer.
strand: The orientation of the hit (+ or -).

peaks_qc.tsv: Per-peak statistics. It includes:

peak_id: The numerical index of the peak region.
nll: The final regression negative log likelihood, proportional to the mean squared error (MSE).
dual_gap: The final duality gap.
num_steps: The number of optimization steps taken.
step_size: The optimization step size.
global_scale: The peak-level scaling factor, used to normalize by overall importance.
chr: The chromosome name, omitted if -p/--peaks not provided.
peak_region_start: The start coordinate of the peak region, zero-indexed, omitted if -p/--peaks not provided.
peak_name: The name of the peak region, derived from the input peak data's name field, omitted if -p/--peaks not provided.

params.json: The parameters used for hit calling.

Output reporting

`finemo report`

Generate an HTML report (report.html) visualizing TF-MoDISCo seqlet recall and hit distributions. The input regions must have genomic coordinates. If -n/--no-recall is not set, peaks must exactly match those used during the TF-MoDISCo motif discovery process. This command does not utilize the GPU.

Usage: finemo report -r <regions> -H <hits> -p <peaks> -m <modisco_h5> -o <out_dir> [-W <modisco_region_width>] [-t <cwm_trim_threshold>] [-n] [-s]

-r/--regions: A .npz file containing input sequences and contributions. Must be the same as those used for motif discovery and hit calling.
-H/--hits: The hits.tsv output file generated by the finemo call-hits command on the regions specified in --regions.
-p/--peaks: A file of peak regions in ENCODE NarrowPeak format, exactly matching the regions specified in --regions.
-m/--modisco-h5: The tfmodisco-lite output H5 file of motif patterns. Must be the same as that used for hit calling unless --no-recall is set.
-o/--out-dir: The path to the output directory.
-W/--modisco-region-width: The width of the region around each peak summit used by tfmodisco-lite. Default is 400.
-t/--cwm-trim-threshold: The threshold to determine motif start and end positions within the full CWMs. This should match the value used in finemo call-hits. Default is 0.3.
-n/--no-recall: Do not compute motif recall metrics. Default is False.
-s/--no-seqlets: Do not generate seqlet visualizations. Must be set in conjunction with --no-recall. Default is False.

austintwang / finemo_gpu

readme