Performance comparison with Rust

I was wondering how fast this package is, so I made a small comparison with rust's fastq library. On v1.6, for fastq files rust is about 3-4 times faster, while for fastq.gz it's only 1.6x (I guess decompressing becomes more of a bottleneck). I think I've used every tricks to make the Julia code as fast as possible, have I missed something ?

While I think the performances are pretty good (1.6x doesn't matter that much), improvement are always welcome. It seems Julia's version is still allocating a lot of memory (1.07 M allocations: 225.188 MiB, 0.11% gc time, for a 100MB fastq.gz file), to me it seems the rust library is not really validating or converting anything, it just load the data and memory and goes through it. While have nicely converted records is nice, maybe a more "bare-bone" API could be provided when performance is really crucial, what do you think ?

My rust code (ends up being simpler than the Julia code):

use fastq::{parse_path, Record};

extern crate fastq;

fn main() {
    let filename = String::from("test.fastq.gz");

    let path = Some(filename);

    let mut total: u64 = 0;
    let mut tot_qual: u64 = 0; 
    parse_path(path, |parser| {
        parser.each(|record| {

            let s = record.seq();
            let q = record.qual();

            for (is, iq) in s.iter().zip(q.iter()) {
                if *is == b'A'{
                    total += 1;
                    tot_qual += (*iq -0x21) as u64;
                }
            }
            true
        }).expect("Invalid fastq file");
    }).expect("Invalid compression");
    println!("{:.15}", tot_qual as f64 / total as f64);
}

And Julia :

using FASTX, CodecZlib, BioSymbols, BioSequences

function count_reads(fasta)
    #reader = FASTQ.Reader(open(fasta))
    reader = FASTQ.Reader(GzipDecompressorStream(open(fasta)))
    record = FASTQ.Record()
    N, tot_qual = 0, 0
    seq = LongDNASeq(200)

    while !eof(reader)
        read!(reader, record)
        q = FASTQ.quality(record)
        copyto!(seq, record)

        @inbounds for i=1:length(q)
            if seq[i] == DNA_A
                N += 1
                tot_qual += q[i]
            end
        end
    end
    close(reader)
    tot_qual/N
end

fasta = "test.fastq.gz"

@time count_reads(fasta)

Dear @jonathanBieler

I've spent lots of time optimizing this library, and looking at where the bottlenecks are. The library is very, very efficient. It's hard - no matter the language, to get meaningful (say 2x) performance improvements on it. All to other libraries I've seen to far tends to reveal that FASTX does much more rigorous input validation.

Nonetheless, let's not pat our backs too hard here. There are still many inefficiencies. Not any big ones - they would have been optimized away by now, but lots of small ones that perhaps add 5% to running time, and accumulate. Let me try and break it down.

Direct and indirect libraries in use

There are several places inefficiencies hides in your code:

In the code in your script itself,
In BioSequences.jl, which you use to contruct your sequences,
In the FASTX.jl code, which deals with the construction and manipulation of FASTQ records,
In Automa.jl, which is where the actual FASTQ parsing happens,
In CodecZlib.jl, where the decompression happens
TranscodingStreams.jl, the buffering package used by the two former
In Julia Base

For a baseline, I timed your Rust and Julia code against a 1.1 GB FASTQ file on my computer. Rust is timed by taking the minimum time reported by hyperfine, Julia by taking the minimum time reported by BenchmarkTools (excluding compilation).

Rust:  1.62s
Julia: 2.96s

So 83% slower. Where does that come from?

Your code

You spend time encoding and decoding the bytes into LongDNASequence objects (see this link for an analysis of the performance of that). Because this re-validates every byte and transforms it into a memory-compact format, significant time is wasted there. It does have advantages - for example, some characters are in valid FASTQ files, but not valid DNA (say, u). In your case, that might not be needed, and you may operate on the data in the FASTQ.Record object directly.
When calling FASTQ.quality, you allocate a new array for every record. This too, is strictly speaking not necessary.

Changing your code to operate on views of the Record's data lowers the time from 2.96s to 2.30s. Now 42% slower than Rust.

I'll skip BioSequences and FASTX, because both of these are highly optimized. I don't think large gains are to be had there.

Automa

This is where the actual parsing happens. Automa is insanely optimized, but there are still losses in a few sections:

Automa doesn't exploid SIMD. At least not until this PR gets merged. Re-timing on that commit lowers the time to 2.05s, 27% slower than Rust. However, unlike Rust, Julia has fairly poor explicit SIMD support. That PR may never get merged, because it would make the package too unstable.
Automa's codegen could use overall improvements (see this issue). However, that'll probably only improve things a little bit, and would take a lot of work.
A fair bit of time is actually lost in Automa's interaction with FASTX. In particular, the way that Automa's parsing has to continually be interrupted and resumed (in readrecord), and how often FASTX/Automa has to check for EOF. This is pretty hard to optimize, but perhaps minor gains could be made there.

One problem is that the original author of Automa have been unreachable for 6 months or so.

CodecZlib

In my experience, CodecZlib tends to be a bit slower than other gzip implementations. Not sure why, though, since it basically just calls a C function directly. Haven't dug too deeply into that. It must be either a problem with the compilation of the C code, or with basic inefficiencies of Julia's IO.

Basic IO functions

I think quite a bit of time is wasted in the buffering library underneath both Automa and CodecZlib, namely TranscodingStreams.jl. This is a little harder to analyse, but in practical usage, it appears to add a layer of indirection which is not 100% optimized.

EOF calls could be faster, and this function is called a LOT
The package doesn't allow transcoding to happen in a background thread.

Futhermore, it seems that a few Base Julia IO functions are not really 100% optimized either.

The losses here are on the margins - but hey, 10 places with 5% loss makes a library 50% slower.

Base Julia

Finally, Base Julia not 100% optimized. Again, it's always small things where one could reasonably argue that this doesn't actually matter. Here's a few examples:

When i "fixed" your code to not create LongSequence, I used views instead. Turns out it's not possible to switch off boundschecking for views, and so that introduces tonnes of unnecessary branches during iteration.
Arrays in Julia have lengths of type Int, but the native code of course needs UInt to index, so every index operation needs a branch to check if the int is negative. Rust doesn't make that mistake.

Conclusion

You can optimize your code to make it faster. And SIMD capabilities could be added to Automa. This cuts the performance difference down to ~30% to Rust.

As usual, performance dies by a thousand cuts. When TranscodingStreams.jl was implemented, I bet no-one thought people would notice, or care about a 5% performance loss. The Julia core devs certainly thought using Int's for array lengths is fine enough - after all, the branch is perfectly predicted, so it's a small cost, right? Views need to be generic, so they should probably have stride information and such, but now it's suddenly harder to just have views that are lightweight and can ommit boundschecks. And explicit SIMD - people probably think it's not THAT important to get right. The autovectorizer is good enough in most cases, right?

In contrast, if you take the view that creating a single unnecessary branch or instruction in a low-level function is unacceptable, this attitude compounds across the entire stack, through all the packages. And then, you're no longer looking at a few percent here and there, but differences in tens of percent, or even nearing 100%.

Oh, and to answer your question: No, I don't think it makes sense to create a bare-bone API. There are no fundamental design mistakes in FASTX or BioSequences that makes it inefficient. It's downstream packages, and tiny inefficiencies scattered across the entire ecosystem. We should be able to make the existing packages fast, it's no use creating new ones.

Thanks for the very nice answer.

When calling FASTQ.quality, you allocate a new array for every record. This too, is strictly speaking not necessary.

As far as I can tell FASTQ.quality is the only way the API provides to access qualities, no ? So you have to do all these allocations even though you don't really need them. To me it seems you should be able to access the data directly, maybe with some iterators, e.g. FASTQ.raw_quality(record) = (record.data[i]-0x21 for i in record.quality) (this gives a decent speed-up on my computer).

That said my example is a bit artificial (although some QC stats are pretty close to this), not sure it would be super useful in practice. Also I've been using this package and others on rather large dataset and they do the job just fine, but sometimes it seems like tools written in C++ are faster.

Last question, I was thinking maybe doing the same thing for BAM files, do you know if such a comparison as been done already ?

That's a good point, we could implement quality like that. One issue would be that at the next iteration, the underlying data of this iterator would be overwritten without any warning. Rust don't have that issue, obviously. So it might need to be an opt-in function, like quality_iter. We could have something similar for the sequence. That's a good idea.

I do think these small benchmarks, like the one you've posted here are quite useful. They help keeping development focused on performance. Without it, we could convince ourselves that parsing FASTQ at 250 MB/s was excellent performance. Right now, I don't think FASTX and its underlying functionality is too far off being the fastest accurate parser around. Maybe we are 50% slower. These 50% still matters and should be focused on.

Speed is not the most important aspect of BioJulia, of course. Its main purpose is to be a swiss army knife that you can always reach to. So flexibility, modularity and integration with other packages is more important. But we do still need to be so fast that people won't re-implement BioJulia functionality in order to make it faster.

I've done comparisons with BAM myself, but I've not seen other people do them. XAM.jl is in a sad state of affairs. It's not spec-compliant and has a number of sub-optimal design decisions. You'll see I have quite a few issues and PR on the repo. But it turned out to be a little hard to fix because once I started fixing some issues, I ended up rewriting large parts of the package, and then I confused myself and became unfocused. I'm still unsure whether it's best to re-write it all from scratch, or to make a long series of breaking PRs and a new release.

One issue I did work on was the performance of XAM.jl's underlying BGZF de/compression library, so I created LibDeflate.jl and CodecBGZF.jl. But XAM.jl haven't yet migrated to CodecBGZF.jl. That should improve its performance significantly.

Closing as resolved. You're welcome to comment to re-open if you disagree

BioJulia / FASTX.jl