Parallelize single-threaded read_converter step and move *.seq files out of --tmp-dir after every input library is converted

mmokrejs commented 6 years ago

Hi, although I provided 19 input files the code run in a single thread. To further scale it could also do the conversion in multiple chunks on each file?

spades.py \
--only-assembler \
--pe1-1 /scratch/mygenome/paired_end_497bp_201709/HT5V3BCXY.2.tt_16D1C3L12.trimmomatic.paired.prinseq.minlen20.3091.3091.pairs_1.fastq \
--pe1-2 /scratch/mygenome/paired_end_497bp_201709/HT5V3BCXY.2.tt_16D1C3L12.trimmomatic.paired.prinseq.minlen20.3091.3091.pairs_2.fastq \
--pe2-1 /scratch/mygenome/paired_end_619bp/HKMHTBCXX.1.tt_16D1C3L12.trimmomatic.paired.prinseq.minlen20.19552.19552.pairs_1.fastq \
--pe2-2 /scratch/mygenome/paired_end_619bp/HKMHTBCXX.1.tt_16D1C3L12.trimmomatic.paired.prinseq.minlen20.19552.19552.pairs_2.fastq \
--pe3-1 /scratch/mygenome/mate_pairs_201709/HWFNLBCXY.2.tt_16D1C3L12.trimmomatic.bbduk.splitnextera.fragments_1.fastq \
--pe3-2 /scratch/mygenome/mate_pairs_201709/HWFNLBCXY.2.tt_16D1C3L12.trimmomatic.bbduk.splitnextera.fragments_2.fastq \
--pe3-s /scratch/mygenome/mate_pairs_201709/HWFNLBCXY.2.tt_16D1C3L12.trimmomatic.bbduk.splitnextera.singletons.fq \
--pe4-1 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.5kb.trimmomatic.bbduk.splitnextera.fragments_1.fastq \
--pe4-2 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.5kb.trimmomatic.bbduk.splitnextera.fragments_2.fastq \
--pe4-s /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.5kb.trimmomatic.bbduk.splitnextera.singletons.fq \
--pe5-1 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.8kb.trimmomatic.bbduk.splitnextera.fragments_1.fastq \
--pe5-2 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.8kb.trimmomatic.bbduk.splitnextera.fragments_2.fastq \
--pe5-s /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.8kb.trimmomatic.bbduk.splitnextera.singletons.fq \
--mp1-1 /scratch/mygenome/mate_pairs_201709/HWFNLBCXY.2.tt_16D1C3L12.trimmomatic.bbduk.splitnextera.lmp_1.fastq \
--mp1-2 /scratch/mygenome/mate_pairs_201709/HWFNLBCXY.2.tt_16D1C3L12.trimmomatic.bbduk.splitnextera.lmp_2.fastq \
--mp2-1 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.5kb.trimmomatic.bbduk.splitnextera.lmp_1.fastq \
--mp2-2 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.5kb.trimmomatic.bbduk.splitnextera.lmp_2.fastq \
--mp3-1 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.8kb.trimmomatic.bbduk.splitnextera.lmp_1.fastq \
--mp3-2 /scratch/mygenome/mate_pairs_201609/HFYJ5AFXX.8kb.trimmomatic.bbduk.splitnextera.lmp_2.fastq \
--trusted-contigs /scratch/work/project/bio/open-9-41/assemblies/tadpole_k165/tt_16D1C3L12.tadpole.contigs.k165.fa \
-t 104 --nanopore /scratch/mygenome/OxfordNanopore/tt_16D1C3L12.OxNano.fastq -m 3000 -k 55,77,99,127 -o tt_16D1C3L12__SPAdes3.11.1_noecc

This probably won't happen soon but let me open a feature request for this. Current version is SPAdes3.11.1. Thank you.

asl commented 6 years ago

This part is normally I/O bound, so multiple threads would make the situation even worse.

mmokrejs commented 6 years ago

We have a parallel filesystem (LustreFS) served by I think 54 working slave machines, and infiniband inbetween. How are the data laid over the many hosts and drives is user configurable per directory or even per file. Stripe size is currently 1MB I think.

And would I be sure the data fits into memory, I would use ramdisk for the actual processing and then move the resulting files into storage filesystem. Oh yes, it does:

$ du -sh mygenome__SPAdes3.11.1_noecc/.bin_reads/
56G mygenome__SPAdes3.11.1_noecc/.bin_reads/
$

The input uncompressed FASTQ files occupied 435.86GB.

mmokrejs commented 6 years ago

Here you can see the "disc" traffic is 102MB/s on average, more reading than writing.

112 x86_64 Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz are available with 3.2TB physical, local RAM

memory_usage__binary-conversion cpu_load__binary-conversion filesytem-usage__binary-conversion

asl commented 6 years ago

Here you can see the "disc" traffic is 104MB/s on average, more reading than writing.

This is how it should be. We're reading FASTQ (text format) and convert to the internal binary format. The read:write ratio 9:1 is very close to the text FASTQ : SPAdes binary format file size ratio.

mmokrejs commented 6 years ago

Here is what the filesystem handles if applications are properly written to read/write in large chunks. A very efficient alternative. bamsort comes from https://github.com/gt1/biobambam2

# samtools sort of a 149GB BAM file takes 1.2TB RAM and uses only a single thread despite '-@ 15' argument
# samtools sort -@ $xthreads -m "$gb_mem_per_thread"G -O bam -T "$1" -o "$2".sorted.bam "$2".bam || exit 255
# 
# bamsort comes from https://github.com/gt1/biobambam2
LIBMAUS2_POSIXFDINPUT_BLOCKSIZE_OVERRIDE==1m
export LIBMAUS2_POSIXFDINPUT_BLOCKSIZE_OVERRIDE
bamsort SO=coordinate blockmb="$take_memory" inputthreads="$input_threads" outputthreads="$output_threads" level=9 index=1 I="$2".bam O="$2".sorted.bam

bamsort_lustrefs_usage__stripe54__8cpus__no_hugepage_defrag__job897845 isrv5

The currently running SPAdes process running read_converter.hpp/binary_converter.hpp supposedly overloaded metadata servers of LustreFS and the kernel after 40minutes of attempts to flush buffers (see high system CPU load in red color in figures below) gave up. I see similar issues when apps write many and too small chunks appending to existing files. Running truss or strace or similar profiling tool should reveal the actual write size of SPAdes binaries.

spades_binary_read_conversion_supposedly_overloaded_lustrefs_metadata_servers

spades_binary_read_conversion_supposedly_overloaded_lustrefs_metadata_servers__cpu_load

mmokrejs commented 6 years ago

I cannot login to the cluster node to verify this but although I am running spades.py --tmp-dir /ramdisk/$PBS_JOBID it seems it is still reading and writing at same pace to LustreFS (~100 kBps). Although I do not see any improvements in terms of the times how quickly spades.py moves to process the many input FASTA files.

And, while the log says now:

0:46:19.694 12M / 700M INFO General (read_converter.hpp : 84) Converting reads to binary format for library #6 (takes a while)

I should not see the paired_6_*.seq files on the networked filesystem until this step is complete, right? They should be still in --tmp-dir.

asl commented 6 years ago

These files will be in the output dir since they are reused across iterations (= long living). Everything else will be on scratch.

mmokrejs commented 6 years ago

I don't understand. The paired_6_*.seq have same modification timestamp because they were continually updated for some while during processing the library #6 of input files. This should have happened in --tmp-dir and then the paired_6_*.seq files should have been moved to tt_16D1C3L12__SPAdes3.11.1_noecc_ramdisk/.bin_reads/. But, until library #7 processing started these files should not be existing in tt_16D1C3L12__SPAdes3.11.1_noecc_ramdisk/.bin_reads/, so what am I missing?

asl commented 6 years ago

This is not how it is done currently. We may consider doing this in some next SPAdes versions. Patches are always welcome though.

ablab / spades

Parallelize single-threaded read_converter step and move *.seq files out of --tmp-dir after every input library is converted #67