RolandFaure / Hairsplitter

Software that separates very close sequences that have been collapsed during assembly. Uses only long reads.
GNU General Public License v3.0
26 stars 0 forks source link

High RAM Usage during "separate_reads" & GraphUnzip Output Fragmented #4

Closed FrostFlow13 closed 1 year ago

FrostFlow13 commented 1 year ago

EDIT:

RAM usage was solved by my decision to remove all reads below 25 kb - it now uses far, far, far more reasonable amounts of RAM (I think it capped out at ~5-10 GB of RAM being used at one point), meaning the shorter reads were absolutely the issue on that front. The GraphUnzip step, unfortunately, is still resulting in fragmented assemblies, even when using the 25+ kb read file (the unzipped .gfa also still looks great).


I've been looking at some of the outputs now that I've managed to get HairSplitter to run all the way to the end, and I've found a few more things I have questions about or have noticed.


First, and this is minor because I'm able to get the information via the job output log from the supercomputing cluster I use, the hairsplitter.log file in my output directory (not the /tmp/) only includes information from Stage 1. For example:

STAGE 1 : Deleting contigs that look like parts of haplotypes but that are disconnected from the rest of the assembly
Suppressing contig edge_34 because it is disconnected from the rest of the graph and contained in edge_32
Suppressing contig edge_54 because it is disconnected from the rest of the graph and contained in edge_51

Again, it's not that big of an issue (for me at least) because I have an output log elsewhere from my job script, but it's just something I've noticed.


Second, during the "separate_reads" step, it seems to have incredibly high RAM usage if you have large long read files, making it hard to predict how much RAM/memory you will need to request/allocate for Hairspitter. Running

#!/bin/bash
#SBATCH --time=08:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=48 --gpus-per-node=2 --partition=gpuserial-48core
#SBATCH --account=PAS1802
#SBATCH --job-name=1739-hairsplitter-DualCore-semiLONG
#SBATCH --export=ALL
#SBATCH --output=1739-hairsplitter-DualCore-semiLONG.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1739/2_flye_assembly-keephap-2/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC16.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter-DualCore-semiLONG
===== STAGE 5: Separating reads by haplotype of origin   [ 2023-08-17 03:41:59.941568 ]

 - Separating reads by haplotype of origin
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/separate_reads ../8_hairsplitter-DualCore-semiLONG/tmp/filtered_variants.col 1 0.0059046 0 ../8_hairsplitter-DualCore-semiLONG/tmp/reads_haplo.gro
ERROR: separate_reads failed. Was trying to run: /users/PAS1802/woodruff207/Hairsplitter/src/build/separate_reads ../8_hairsplitter-DualCore-semiLONG/tmp/filtered_variants.col 1 0.0059046 0 ../8_hairsplitter-DualCore-semiLONG/tmp/reads_haplo.gro
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=23613304.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The BC16.fastq file was 8.6 GB large, and the input contigs from this run were from a .gfa file, where the largest contigs were: 1,886,358 bp; 1,780,637 bp; 1,381,840 bp; and 1,243,546 bp (the rest were below 1,000,000 bp).

Subsampling the file to reduce size worked for me previously, but I'm concerned that doing that might be removing larger reads that could help improve detection or implementation of heterozygous deletions or inversions (assuming that's something HairSplitter is capable of during the generation of new contigs). Additionally, given how long it takes to run, experimenting with settings becomes hard when there's a few hours to wait for space to be available for a job, then having to wait and see whether or not there is enough memory for it.


Third, for my two runs that have worked (one was using my original haploid assembly, the second was using a .gfa file from Flye output, both used the subsampled long read file), it seems like GraphUnzip (I think it's GraphUnzip at least) is acting oddly compared to your poster/presentation. From your poster, your before and after images look great and the phased chromosomes remain adjacent/connected: image

However, when I ran it myself, the end result (hairsplitter_final_assembly.gfa and hairsplitter_final_assembly.fasta) for both the haploid assembly and the .gfa Flye assembly were heavily fragmented (the following is from my .gfa assembly, but it heavily mimics what I saw for the haploid assembly): image

What's most confusing to me is that I checked the /tmp/ directory, there are two big things of note - the logGrapUnzip.txt file is empty, so there's no information on what exactly it was doing. Second, I checked the zipped_assembly.gfa file, and it looks like it had actually done a great job at phasing portions of the genome: image

When I check the hairsplitter_final_assembly.gfa and hairsplitter_final_assembly.fasta files, they have done an EXCELLENT job at phasing SNPs (relative to a current phased assembly in the field, and apart from missing some heterozygous deletions/inversions) - as I mentioned, it just seems like GraphUnzip somehow heavily fragmented the resulting assembly, rather than maintaining the previous connections like your poster seems to have done. My long read file does have a number of shorter reads (shorter being between ~300-1000 bp) - could those possibly be interfering with the unzipping step somehow? Instead of randomly subsampling long reads, should I instead selectively remove reads shorter than ~2000-2500 bp, since Hairsplitter/GraphUnzip seems to be using ~2000 bp segments (plus or minus a few bases) for phasing?


UPDATE:

I'm going to experiment with a pipeline that drops either all reads below 10kb or 25kb, run Flye to assemble a draft genome, then try running that through Hairsplitter. Looking at the stats of the .fastq files using seqkit stats [NAME].fastq makes me feel fairly confident that the shorter reads could be choking Hairsplitter as it tries to draw upon the reads:

$ seqkit stats BC15.fastq 
file        format  type   num_seqs        sum_len  min_len  avg_len  max_len
BC15.fastq  FASTQ   DNA   1,738,400  6,814,432,181        1  3,919.9  200,922

$ seqkit stats BC15-10kbmin.fastq 
file                format  type  num_seqs        sum_len  min_len  avg_len  max_len
BC15-10kbmin.fastq  FASTQ   DNA    169,356  4,732,147,553   10,000   27,942  200,922

$ seqkit stats BC15-25kbmin.fastq 
file                format  type  num_seqs        sum_len  min_len  avg_len  max_len
BC15-25kbmin.fastq  FASTQ   DNA     80,732  3,294,751,518   25,000   40,811  200,922

Hopefully, after removing all of those shorter length reads, this might solve a few more problems.

RolandFaure commented 1 year ago

Hello, Thank you very much for your feedback, I'm very happy the phasing seems to be working well.

FrostFlow13 commented 1 year ago

Thank you for helping!

RAM Usage & Filtering Short Reads: removing all of the shorter reads definitely helped A LOT.

GraphUnzip: thank you for for the extra information on the output from GraphUnzip! I'll definitely give --multiploid a try - I know the exact ploidy of the strain, and haplotypes SHOULD have the same coverage (barring a few repetitive regions that are known to expand and contract independently between haplotypes).