citiususc / BigSeqKit

BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
GNU General Public License v3.0
54 stars 4 forks source link

seqkit stats in single machine with 20-32 threads #1

Closed avilella closed 1 year ago

avilella commented 1 year ago

Hi, would bigseqkit improve the speed of seqkit stats in single machine with 20-32 threads? Thx

cesarpomar commented 1 year ago

Hi,

In the specific case of bigseqkit stats, it's a task that doesn't require much computational time, so parallelizing it within a single machine doesn't make much sense. The execution time will be primarily limited by the machine's disk read speed. To reduce this time, you can consider using a computing cluster where multiple machines can read the file in parallel.

Best regards.

jcpichel commented 1 year ago

Being more specific, "stats" doesn't perform a large number of operations per sequence and, therefore, it is limited by the memory/disk bus width. Once the number of threads being used fills the bus, adding new threads will not increase performance. This is not the case if we parallelize "stats" using multiple nodes, as the cumulative bandwidth of several nodes will be higher.