MiraldiLab / maxATAC

Transcription Factor Binding Prediction from ATAC-seq and scATAC-seq with Deep Neural Networks
Apache License 2.0
25 stars 8 forks source link

Improve `maxatac prepare` #107

Closed tacazares closed 1 year ago

tacazares commented 2 years ago

The maxatac prepare function was initially created as a convenience function for filtering, inferring Tn5 sites, and converting cut -site level coverage to min-max normalized bigwig tracks in one step. This function calls on the bash scripts that were used by our snakemake/cwl/bash workflows for ATAC-seq data processing. This will most likely be how most users prepare data as opposed to going through each step individually, so we should think about improving the user experience.

The problem is that it appears to the user that the run completes correctly, despite having encountered some error during running the shell script. We should add more logging information during processing. We should also look into whether we should use python to execute the shell commands as opposed to just running a shell script from python with the commands internally. We could also add code to the shell script to catch problems with execution.

diamremk commented 2 years ago

Hey, I gave maxatac prepare this input:

`#!/bin/bash

BSUB -W 12:00

BSUB -M 200000

BSUB -n 24

BSUB -J kessi_maxATAC_prepare

BSUB -o /data/miraldiLab/team/kessi/out_files/maxATAC_prepare_test_071522.out

BSUB -e /data/miraldiLab/team/kessi/err_files/maxATAC_prepare_test_071522.err

module load anaconda

module load anaconda (I don't remember the exact number but I think it is 3 you can check with module avail)

module load anaconda3/1.0.0

conda activate your env

conda activate YOUR_CONDA_ENV_NAME # This env should have maxatac installed

source activate maxatac_test

No need for the sh infront of maxatac

maxatac prepare \ -i /data/miraldiLab/team/kessi/snakeATAC/output_dir/Th0_48h_r1/alignments/Th0_48h_r1.bam \ -o /data/miraldiLab/team/kessi/maxATAC \ -prefix Th0_48h_r1 \ --blacklist_bed /users/dia6sx/opt/maxatac/data/mm10/mm10_maxatac_blacklist.bed \ --blacklist_bw /users/dia6sx/opt/maxatac/data/mm10/mm10_maxatac_blacklist.bw \ --chrom_sizes /users/dia6sx/opt/maxatac/data/mm10/mm10.chrom.sizes \ -chroms chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19`

`[2022-07-15 16:28:03,370] Input file: /data/miraldiLab/team/kessi/snakeATAC/output_dir/Th0_48h_r1/alignments/Th0_48h_r1.bam Input chromosome sizes file: /users/dia6sx/opt/maxatac/data/mm10/mm10.chrom.sizes Tn5 cut sites will be slopped 20 bps on each side Input blacklist file: /users/dia6sx/opt/maxatac/data/hg38/hg38_maxatac_blacklist.bw Output filename: Th0_48h_r1 Output directory: /data/miraldiLab/team/kessi/maxATAC Using a millions factor of: 20000000 Using 12 threads to run job. [2022-07-15 16:28:03,471] Generate the normalized signal tracks. [2022-07-15 16:28:03,471] Working on a bulk ATAC-seq BAM file Getting the number of reads in the BAM file [2022-07-15 16:28:10,236] Processing BAM to bigwig. Running eduplication fixmate: invalid option -- '@' Usage: samtools fixmate Options: -r Remove unmapped reads and secondary alignments -p Disable FR proper pair check -c Add template cigar ct tag --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null]

As elsewhere in samtools, use '-' as the filename for stdin/stdout. The input file must be grouped by read name (e.g. sorted by name). Coordinated sorted input is not accepted. index: invalid option -- '@' Usage: samtools index [-bc] [-m INT] [out.index] Options: -b Generate BAI-format index for BAM files [default] -c Generate CSI-format index for BAM files -m INT Set minimum interval size for CSI indices to 2^INT [14] [main] unrecognized command 'markdup' index: invalid option -- '@' Usage: samtools index [-bc] [-m INT] [out.index] Options: -b Generate BAI-format index for BAM files [default] -c Generate CSI-format index for BAM files -m INT Set minimum interval size for CSI indices to 2^INT [14] rm: cannot remove ‘Th0_48h_r1_fixmate.bam.bai’: No such file or directory [main_samview] random alignment retrieval only works for indexed BAM or CRAM files. index: invalid option -- '@' Usage: samtools index [-bc] [-m INT] [out.index] Options: -b Generate BAI-format index for BAM files [default] -c Generate CSI-format index for BAM files -m INT Set minimum interval size for CSI indices to 2^INT [14] rm: cannot remove ‘Th0_48h_r1_deduped.bam.bai’: No such file or directory bedGraphToBigWig: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory [2022-07-15 16:29:36,562] Min-max normalize signal tracks [2022-07-15 16:29:36,565] Normalization Input bigwig file: /data/miraldiLab/team/kessi/maxATAC/Th0_48h_r1_IS_slop20_RP20M.bw Output filename: /data/miraldiLab/team/kessi/maxATAC/Th0_48h_r1_IS_slop20_RP20M_minmax01.bw Output directory: /data/miraldiLab/team/kessi/maxATAC Using min-max normalization [2022-07-15 16:29:36,583] Calculating stats per chromosome [urlOpen] Couldn't open /data/miraldiLab/team/kessi/maxATAC/Th0_48h_r1_IS_slop20_RP20M.bw for reading [urlOpen] Couldn't open /data/miraldiLab/team/kessi/maxATAC/Th0_48h_r1_IS_slop20_RP20M.bw for reading [pyBwOpen] bw is NULL! Traceback (most recent call last): File "/users/dia6sx/.conda/envs/maxatac_test/bin/maxatac", line 24, in sys.exit(main(sys.argv[1:])) File "/users/dia6sx/.conda/envs/maxatac_test/bin/maxatac", line 20, in main args.func(args) File "/users/dia6sx/.conda/envs/maxatac_test/lib/python3.9/site-packages/maxatac/analyses/prepare.py", line 142, in run_prepare run_normalization(args) File "/users/dia6sx/.conda/envs/maxatac_test/lib/python3.9/site-packages/maxatac/analyses/normalize.py", line 64, in run_normalization min_value, max_value, median, mad, mean_value, std_value = get_genomic_stats(bigwig_path=args.signal, File "/users/dia6sx/.conda/envs/maxatac_test/lib/python3.9/site-packages/maxatac/utilities/normalization_tools.py", line 24, in get_genomic_stats with pyBigWig.open(bigwig_path) as input_bigwig: RuntimeError: Received an error during file opening!`

There seems to be a problem when preparing a bam file from snakeATAC. Note I am able to bypass the prepare step by using the bw generated.