Fasta parsing experiment

zachcp commented 1 year ago

Hi @fubark ,

Thanks again for your awesome language. I played around with cyber a bit today for fasta parsing to see how it might fare against some other languages (inspiration here). My results are here if you are interested in taking a look. Right now python is ahead by ~ 2 orders of magnitude. I know cyber is designed for embedded systems but I thought i might get lucky with some fast I/O as well :).

This is a really promising language thats been fun to use; thank you. zach cp

time python3 readfq.py < GCA_013297495.1_ASM1329749v1_genomic.fna
real    0m1.065s

time ./cyber readfq.cy <  GCA_013297495.1_ASM1329749v1_genomic.fna
real    2m24.335s

time ./cyber readfq2.cy <  GCA_013297495.1_ASM1329749v1_genomic.fna
real    2m30.641s

fubark commented 1 year ago

Thanks for providing readfq2. It helped me narrow down the perf bottleneck quickly. readLine was meant for getting the user input from the command line and not bulk reads from stdin. For that reason, I deprecated readLine in favor of getInput. As for bulk reads on std.in you can do the following now in readfq2:

import os 'os'

--- minimal parse. don't use object or fastq
---  '@+>' is  64 / 43 / 62
func is_fastx(chr) bool:
    if chr == 64:
        return true
    if chr == 62:
        return true
    return false
n     = 0
slen  = 0
qlen  = 0
for os.stdin.streamLines() as line:
    if is_fastx(line.charAt(0)):
        n    += 1
    else:
        slen += line.len()

print 'There are {slen} bases from {n} records in this file.'

On my linux machine, this is now twice as fast as the python3 version (still much room for improvement but now it's a more fair comparison in regards to reading lines from stdin). Although the python script seems to be doing more in the script... I'm going to see what missing functions there are and also flesh out more of the new File api.

zachcp commented 1 year ago

Boom shakalaka! Amazing work.

Note: if cyber can compete favorably on these benchmarks I think you might unlock a bioinformatics market segment.....

# Same for me on MacOS!

time python3 readfq.py < GCA_013297495.1_ASM1329749v1_genomic.fna
 There are 341540 records and 161512289 bases

real    0m0.898s
user    0m0.794s
sys 0m0.072s

time ./cyber readfq3.cy <  GCA_013297495.1_ASM1329749v1_genomic.fna
There are 163709211 bases from 341540 records in this file.

real    0m0.393s
user    0m0.323s
sys 0m0.062s

fubark commented 1 year ago

I just made the same script even faster using simd to find the new line character. Also you can now provide a read buffer size to streamLines(). It defaults to 4096 bytes, but I've found that 4MB works well for larger files. Between this and simd (mostly simd), I'm seeing almost another 2x in performance gains.

Also worth mentioning the same simd technique is now made available for string.indexChar()

fubark / cyber

Fasta parsing experiment #5