cbg-ethz / PredictHaplo

This software aims at reconstructing haplotypes from next-generation sequencing data.
GNU General Public License v3.0
5 stars 0 forks source link

Indefinite Run-time #33

Open Al313 opened 8 months ago

Al313 commented 8 months ago

Dear developers of PredictHaplo,

I have been trying to use your tool for a bunch of NGS samples from an HIV-1 long-term evolution experiment. The tool seems to work in reasonable time frames for earlier samples in the experiment (where evidently there is less variation), but fails to finish processing for samples from later in the experiment. At this point my guess is that the higher variation is causing problems, though I am not sure. When I look at the log output the program, I see it gets stuck from proceeding without giving any errors. I left the program run for 6 days using 25GB of RAM on an HPC, but to no avail.

Below I share the code that I use and please use this link to access example SAM file + reference sequence (https://drive.google.com/drive/folders/1A-JbndsDgAelVpiPOLB4glOehNI9wlNg?usp=sharing). Please let me know if you need further information from me for examining this issue. I'd appreciate any thoughts from you on what is causing the indefinite runtime, and how can I address it; i.e., changing parameters in the config file or altering the input file (downsampling).

Thanks in advance, Ali Movasati

config file:

% configuration file for the HIVhaplotyper % prefix ./test-my/13_550_ % filename of reference sequence (FASTA) /home/amovas/data/genome-evo-proj/data/reference/plasmid/hiv_plasmid_ref_genome.fasta % do_visualize (1 = true, 0 = false) 1 % filname of the aligned reads (sam format) 13MT2EXPIIIVP550seq13072023_S4_L001_R1_001.sam % have_true_haplotypes (1 = true, 0 = false) 0 % filname of the true haplotypes (MSA in FASTA format) (fill in any dummy filename if there is no "true" haplotypes) dummy % do_local_analysis (1 = true, 0 = false) (must be 1 in the first run) 1 % max_reads_in_window; 10000 % entropy_threshold 4e-2 %reconstruction_start 454 %reconstruction_stop 9626 %min_mapping_qual 30 %min_readlength 220 %max_gap_fraction (relative to alignment length) 0.05 %min_align_score_fraction (relative to read length) 0.35 %alpha_MN_local (prior parameter for multinomial tables over the nucleotides) 25 %min_overlap_factor (reads must have an overlap with the local reconstruction window of at least this factor times the window size) 0.85 %local_window_size_factor (size of local reconstruction window relative to the median of the read lengths) 0.7 % max number of clusters (in the truncated Dirichlet process) 25 % MCMC iterations 501 % include deletions (0 = no, 1 = yes) 1

slurm script:

`

!/bin/bash

timestamp=$(date +%F_%T)

job_dir="/home/amovas/scratch/.slurm/jobs/${timestamp}"

if [ ! -d ${job_dir} ]; then mkdir -p ${job_dir}; fi

job_file=${job_dir}/snakemake.job

output_dir="/home/amovas/scratch/.slurm/outs/${timestamp}"

if [ ! -d ${output_dir} ]; then mkdir -p ${output_dir}; fi

echo "#!/bin/bash

SBATCH -J test_phaplo

SBATCH -c 1

SBATCH --time=8-24:00:00

SBATCH --mem=25G

SBATCH --output /home/amovas/scratch/.slurm/outs/${timestamp}/%j.out

./PredictHaplo-Paired config_my

" > ${job_file}

sbatch ${job_file} `

LaraFuhrmann commented 8 months ago

Hi Ali,

which version of the tool are you running? From the fact that you use a config_file, I guess it is not the most recent one.

One thing that you could try is using the most recent version from this GitHub. I know that memory allocation problems have been addressed in this version, however, I am not sure if this will solve your problem.

Here is how you can install it:

  1. create conda enviroment with the following packages cxx-compiler = 1.4.1, make = 4.3, cmake = 3.22.1, liblapack = 3.9.0, gtest = 1.11.0
  2. Then git clone https://github.com/cbg-ethz/PredictHaplo.git
  3. go into the direcoty PredictHaplo
  4. cmake --build build
  5. then the executable is in PredictHaplo/build/predicthaplo

I am no expert in getting C++-code running but this worked in the past.

Al313 commented 7 months ago

Dear Laura,

Thanks for getting back to me with your suggestion. Indeed, I was running an older version of the tool. After installing and running the current version from the GitHub repository, however, I encountered a "Segmentation Fault" error. I was unable to solve this error on my own. I looked for a solution in "Issues" history of this repository, as this error has been reported by others too, but could not find a helpful solution. I have already shared examplary sam and reference files of my dataset in the OP; I'd appreciate it if you or other developers could try and run the tool on that sample and share insights as to what could be causing the "Segmentation Fault" error.

Thanks for your consideration, Ali