High RAM Usage during "separate_reads" & GraphUnzip Output Fragmented

FrostFlow13 commented 1 year ago

EDIT:

RAM usage was solved by my decision to remove all reads below 25 kb - it now uses far, far, far more reasonable amounts of RAM (I think it capped out at ~5-10 GB of RAM being used at one point), meaning the shorter reads were absolutely the issue on that front. The GraphUnzip step, unfortunately, is still resulting in fragmented assemblies, even when using the 25+ kb read file (the unzipped .gfa also still looks great).

I've been looking at some of the outputs now that I've managed to get HairSplitter to run all the way to the end, and I've found a few more things I have questions about or have noticed.

First, and this is minor because I'm able to get the information via the job output log from the supercomputing cluster I use, the hairsplitter.log file in my output directory (not the /tmp/) only includes information from Stage 1. For example:

STAGE 1 : Deleting contigs that look like parts of haplotypes but that are disconnected from the rest of the assembly
Suppressing contig edge_34 because it is disconnected from the rest of the graph and contained in edge_32
Suppressing contig edge_54 because it is disconnected from the rest of the graph and contained in edge_51

Again, it's not that big of an issue (for me at least) because I have an output log elsewhere from my job script, but it's just something I've noticed.

Second, during the "separate_reads" step, it seems to have incredibly high RAM usage if you have large long read files, making it hard to predict how much RAM/memory you will need to request/allocate for Hairspitter. Running

#!/bin/bash
#SBATCH --time=08:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=48 --gpus-per-node=2 --partition=gpuserial-48core
#SBATCH --account=PAS1802
#SBATCH --job-name=1739-hairsplitter-DualCore-semiLONG
#SBATCH --export=ALL
#SBATCH --output=1739-hairsplitter-DualCore-semiLONG.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1739/2_flye_assembly-keephap-2/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC16.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter-DualCore-semiLONG

which gives it ~370 GB of RAM to use in the supercomputer - gave an out-of-memory error at the separate_reads step:

===== STAGE 5: Separating reads by haplotype of origin   [ 2023-08-17 03:41:59.941568 ]

 - Separating reads by haplotype of origin
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/separate_reads ../8_hairsplitter-DualCore-semiLONG/tmp/filtered_variants.col 1 0.0059046 0 ../8_hairsplitter-DualCore-semiLONG/tmp/reads_haplo.gro
ERROR: separate_reads failed. Was trying to run: /users/PAS1802/woodruff207/Hairsplitter/src/build/separate_reads ../8_hairsplitter-DualCore-semiLONG/tmp/filtered_variants.col 1 0.0059046 0 ../8_hairsplitter-DualCore-semiLONG/tmp/reads_haplo.gro
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=23613304.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The BC16.fastq file was 8.6 GB large, and the input contigs from this run were from a .gfa file, where the largest contigs were: 1,886,358 bp; 1,780,637 bp; 1,381,840 bp; and 1,243,546 bp (the rest were below 1,000,000 bp).

Subsampling the file to reduce size worked for me previously, but I'm concerned that doing that might be removing larger reads that could help improve detection or implementation of heterozygous deletions or inversions (assuming that's something HairSplitter is capable of during the generation of new contigs). Additionally, given how long it takes to run, experimenting with settings becomes hard when there's a few hours to wait for space to be available for a job, then having to wait and see whether or not there is enough memory for it.

Third, for my two runs that have worked (one was using my original haploid assembly, the second was using a .gfa file from Flye output, both used the subsampled long read file), it seems like GraphUnzip (I think it's GraphUnzip at least) is acting oddly compared to your poster/presentation. From your poster, your before and after images look great and the phased chromosomes remain adjacent/connected:

However, when I ran it myself, the end result (hairsplitter_final_assembly.gfa and hairsplitter_final_assembly.fasta) for both the haploid assembly and the .gfa Flye assembly were heavily fragmented (the following is from my .gfa assembly, but it heavily mimics what I saw for the haploid assembly):

What's most confusing to me is that I checked the /tmp/ directory, there are two big things of note - the logGrapUnzip.txt file is empty, so there's no information on what exactly it was doing. Second, I checked the zipped_assembly.gfa file, and it looks like it had actually done a great job at phasing portions of the genome:

When I check the hairsplitter_final_assembly.gfa and hairsplitter_final_assembly.fasta files, they have done an EXCELLENT job at phasing SNPs (relative to a current phased assembly in the field, and apart from missing some heterozygous deletions/inversions) - as I mentioned, it just seems like GraphUnzip somehow heavily fragmented the resulting assembly, rather than maintaining the previous connections like your poster seems to have done. My long read file does have a number of shorter reads (shorter being between ~300-1000 bp) - could those possibly be interfering with the unzipping step somehow? Instead of randomly subsampling long reads, should I instead selectively remove reads shorter than ~2000-2500 bp, since Hairsplitter/GraphUnzip seems to be using ~2000 bp segments (plus or minus a few bases) for phasing?

UPDATE:

I'm going to experiment with a pipeline that drops either all reads below 10kb or 25kb, run Flye to assemble a draft genome, then try running that through Hairsplitter. Looking at the stats of the .fastq files using seqkit stats [NAME].fastq makes me feel fairly confident that the shorter reads could be choking Hairsplitter as it tries to draw upon the reads:

$ seqkit stats BC15.fastq 
file        format  type   num_seqs        sum_len  min_len  avg_len  max_len
BC15.fastq  FASTQ   DNA   1,738,400  6,814,432,181        1  3,919.9  200,922

$ seqkit stats BC15-10kbmin.fastq 
file                format  type  num_seqs        sum_len  min_len  avg_len  max_len
BC15-10kbmin.fastq  FASTQ   DNA    169,356  4,732,147,553   10,000   27,942  200,922

$ seqkit stats BC15-25kbmin.fastq 
file                format  type  num_seqs        sum_len  min_len  avg_len  max_len
BC15-25kbmin.fastq  FASTQ   DNA     80,732  3,294,751,518   25,000   40,811  200,922

Hopefully, after removing all of those shorter length reads, this might solve a few more problems.

RolandFaure commented 1 year ago

Hello, Thank you very much for your feedback, I'm very happy the phasing seems to be working well.

RAM usage: in the separate_reads step, RAM usage is quadratic in the maximum number of reads aligning on a contig. In the new release I added a step that cuts contigs in chunks of 300000bp maximum to limit RAM usage. Downsampling reads is another possibility
Filtering short reads: if your dataset has a reasonable coverage (30x) without the shorter reads, suppressing shorter reads will indeed speed up considerably HairSplitter with very little loss.
GraphUnzip output: GraphUnzip tries to improve the contiguity of the zipped_assembly.gfa by doing a sort of gap-less scaffolding. So you can be confident that the SNPs phased in zipped_assembly.gfa are still there in hairsplitter_final_assembly.gfa. If you do an homology check of the contigs of hairsplitter_final_assembly.gfa, I would suspect you find homologous contigs. Everything will be clearer if you use option --dont-simplify of HairSplitter and compare the two assemblies then. HOWEVER, there is a very tricky compromise in GraphUnzip between improving N50 and not fragmenting the assembly, which is why the resulting assembly is fragmented. The difficlulty stems from the fact that HairSplitter refuses to make assumption about coverage or number of strains. I added a flag (that I have not extensively tested), --multiploid for the cases where all the haplotypes have the same coverage - in which case HairSplitter can make much stronger assumptions and hopefully fragment much less the assembly, if you want to give it a try.

FrostFlow13 commented 1 year ago

Thank you for helping!

RAM Usage & Filtering Short Reads: removing all of the shorter reads definitely helped A LOT.

GraphUnzip: thank you for for the extra information on the output from GraphUnzip! I'll definitely give --multiploid a try - I know the exact ploidy of the strain, and haplotypes SHOULD have the same coverage (barring a few repetitive regions that are known to expand and contract independently between haplotypes).

RolandFaure / Hairsplitter

High RAM Usage during "separate_reads" & GraphUnzip Output Fragmented #4

EDIT:

UPDATE: