chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
529 stars 86 forks source link

issues with memory usage #686

Closed LaurenHuet closed 3 weeks ago

LaurenHuet commented 1 month ago

Hello,

I have been doing hifi + hic assemblies with hifiasm, and over the past few weeks i have been experiencing higher then usual memory usage. I am trying to determine the cause of this issue, if there is something going astray with hifiasm, or something larger happening on my HPC cluster.

I have been testing this using both the 0.19.9 version and the 0.19.8. I am using singularity containers and have tested with both the new version of singularity 4.1 and the old version 3.11. I am using a HPC with slurm.

I have tested this with some old data that has successfully ran with hifiasm in the past to ensure there is not an issue with our new data.

hifiasm_memory_useage.xlsx

I have tracked 14 attempts at assembly and have attached a spreadsheet with the date of assembly, the version of hifiasm, the version of singularity, the estimated genomesize, the hifi coverage and the hic coverage, the amount of memory it used, its wall time and if the assembly was successful.

There are 2 samples of previous assemblies in the spreadsheet where hifiasm has behaved as expected, one for a shark genome that was over 4GB, and one for a small fish genomes less than 1GB, these are highlighted in blue. I have attempted a reassembly of the smaller fish genome twice, with different versions of software, these are highlighted in orange.

hifiasm_memory_usage_screenshot

This is the script that i used for the sample of the large shark genome which was successful

#!/bin/bash --login

#---------------
#hifiasm.sh: runs HiFiasm - a fast haplotype-resolved de novo assembler for PacBio HiFi reads, with Hi-C integration

#---------------
#Requested resources:
#SBATCH --account=######
#SBATCH --job-name=hifiasm
#SBATCH --partition=highmem
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --time=72:00:00
#SBATCH --mem=750G
#SBATCH --export=ALL
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
#SBATCH --mail-type=BEGIN,END

date=$(date +%y%m%d)

echo "========================================="
echo "SLURM_JOB_ID = $SLURM_JOB_ID"
echo "SLURM_NODELIST = $SLURM_NODELIST"
echo "DATE: $date"
echo "========================================="

#---------------
# Define variables from config file
sample=$(grep '^sample=' ../../config.ini | cut -d '=' -f2)
seq_date="v$(grep '^seq_date=' ../../config.ini | cut -d '=' -f2)"
asm_ver=$(grep '^asm_ver=' ../../config.ini | cut -d '=' -f2)
ver="${seq_date}.${asm_ver}"
out="${sample}_${seq_date}"

# Define paths
full_path=$(pwd)
processed_dir="$(dirname "$(dirname "$full_path")")/02-assembly"
hic_dir="$(dirname "$full_path")/raw/hic/"

echo "sample: $sample"
echo "seq_date: $seq_date"
echo "ver: $ver"
echo "output prefix: $out"
echo "Full path: $full_path"
echo "Target directory: $processed_dir"

# Define Hi-C files 
H1="${hic_dir}"OG706L-7_HICL1_S5_R1.fastq.gz
for file in $H1; do
    echo "Hi-C forward: $file"
done

H2="${hic_dir}"OG706L-7_HICL1_S5_R2.fastq.gz
for file in $H2; do
    echo "Hi-C reverse: $file"
done

#---------------
# Run hifiasm, then move output files to target dir
singularity run $SING/hifiasm:0.19.8.sif hifiasm -t 128 -o $out --primary --h1 ${H1} --h2 ${H2} *fastq.gz \
&& find . -type f \( -name "*.gfa" -o -name "*.bed" -o -name "*.bin" \) -exec mv {} "$processed_dir" \;
echo "Assembly output files moved to the 02-assembly directory."

#---------------
#Successfully finished
echo "Done"
exit 0

This is the code ran with the new hifiasm container for the smaller fish genome, this has been run twice, the one in June was successful, anything from the past 3 weeks has failed.

#!/bin/bash --login

#---------------
#hifiasm.sh: runs HiFiasm - a fast haplotype-resolved de novo assembler for PacBio HiFi reads, with Hi-C integration

#---------------
#Requested resources:
#SBATCH --account=#####
#SBATCH --job-name=hifiasm
#SBATCH --partition=work
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --time=12:00:00
#SBATCH --mem=230G
#SBATCH --export=ALL
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

date=$(date +%y%m%d)

echo "========================================="
echo "SLURM_JOB_ID = $SLURM_JOB_ID"
echo "SLURM_NODELIST = $SLURM_NODELIST"
echo "DATE: $date"
echo "========================================="

module load pawseyenv/2023.08
module load singularity/3.11.4-slurm

#---------------
# Define variables from config file
sample=$(grep '^sample=' ../../config.ini | cut -d '=' -f2)
seq_date="v$(grep '^seq_date=' ../../config.ini | cut -d '=' -f2)"
asm_ver=$(grep '^asm_ver=' ../../config.ini | cut -d '=' -f2)
ver="${seq_date}.${asm_ver}"
out="${sample}_${seq_date}"

# Define paths
full_path=$(pwd)
processed_dir="$(dirname "$(dirname "$full_path")")/02-assembly"
hic_dir="$(dirname "$full_path")/raw/hic/"

echo "sample: $sample"
echo "seq_date: $seq_date"
echo "ver: $ver"
echo "output prefix: $out"
echo "Full path: $full_path"
echo "Target directory: $processed_dir"

# Define Hi-C files 
H1="${hic_dir}"*R1*.fastq.gz
for file in $H1; do
    echo "Hi-C forward: $file"
done

H2="${hic_dir}"*R2*.fastq.gz
for file in $H2; do
    echo "Hi-C reverse: $file"
done

#---------------
# Run hifiasm, then move output files to target dir
singularity run $SING/hifiasm:0.19.9.sif hifiasm -t 128 -o $out --primary --h1 ${H1} --h2 ${H2} *fastq.gz \
&& find . -type f \( -name "*.gfa" -o -name "*.bed" -o -name "*.bin" \) -exec mv {} "$processed_dir" \;
echo "Assembly output files moved to the 02-assembly directory."

Has there been any reports of any memory leakage, I cannot seem to find any pure memory related issued opened at the moment.