BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.
https://biojulia.dev
MIT License
61 stars 20 forks source link

Performance comparison with Rust #40

Closed jonathanBieler closed 3 years ago

jonathanBieler commented 3 years ago

I was wondering how fast this package is, so I made a small comparison with rust's fastq library. On v1.6, for fastq files rust is about 3-4 times faster, while for fastq.gz it's only 1.6x (I guess decompressing becomes more of a bottleneck). I think I've used every tricks to make the Julia code as fast as possible, have I missed something ?

While I think the performances are pretty good (1.6x doesn't matter that much), improvement are always welcome. It seems Julia's version is still allocating a lot of memory (1.07 M allocations: 225.188 MiB, 0.11% gc time, for a 100MB fastq.gz file), to me it seems the rust library is not really validating or converting anything, it just load the data and memory and goes through it. While have nicely converted records is nice, maybe a more "bare-bone" API could be provided when performance is really crucial, what do you think ?

My rust code (ends up being simpler than the Julia code):

use fastq::{parse_path, Record};

extern crate fastq;

fn main() {
    let filename = String::from("test.fastq.gz");

    let path = Some(filename);

    let mut total: u64 = 0;
    let mut tot_qual: u64 = 0; 
    parse_path(path, |parser| {
        parser.each(|record| {

            let s = record.seq();
            let q = record.qual();

            for (is, iq) in s.iter().zip(q.iter()) {
                if *is == b'A'{
                    total += 1;
                    tot_qual += (*iq -0x21) as u64;
                }
            }
            true
        }).expect("Invalid fastq file");
    }).expect("Invalid compression");
    println!("{:.15}", tot_qual as f64 / total as f64);
}

And Julia :

using FASTX, CodecZlib, BioSymbols, BioSequences

function count_reads(fasta)
    #reader = FASTQ.Reader(open(fasta))
    reader = FASTQ.Reader(GzipDecompressorStream(open(fasta)))
    record = FASTQ.Record()
    N, tot_qual = 0, 0
    seq = LongDNASeq(200)

    while !eof(reader)
        read!(reader, record)
        q = FASTQ.quality(record)
        copyto!(seq, record)

        @inbounds for i=1:length(q)
            if seq[i] == DNA_A
                N += 1
                tot_qual += q[i]
            end
        end
    end
    close(reader)
    tot_qual/N
end

fasta = "test.fastq.gz"

@time count_reads(fasta)
jakobnissen commented 3 years ago

Dear @jonathanBieler

I've spent lots of time optimizing this library, and looking at where the bottlenecks are. The library is very, very efficient. It's hard - no matter the language, to get meaningful (say 2x) performance improvements on it. All to other libraries I've seen to far tends to reveal that FASTX does much more rigorous input validation.

Nonetheless, let's not pat our backs too hard here. There are still many inefficiencies. Not any big ones - they would have been optimized away by now, but lots of small ones that perhaps add 5% to running time, and accumulate. Let me try and break it down.

Direct and indirect libraries in use

There are several places inefficiencies hides in your code:

For a baseline, I timed your Rust and Julia code against a 1.1 GB FASTQ file on my computer. Rust is timed by taking the minimum time reported by hyperfine, Julia by taking the minimum time reported by BenchmarkTools (excluding compilation).

Rust:  1.62s
Julia: 2.96s

So 83% slower. Where does that come from?

Your code

Changing your code to operate on views of the Record's data lowers the time from 2.96s to 2.30s. Now 42% slower than Rust.

I'll skip BioSequences and FASTX, because both of these are highly optimized. I don't think large gains are to be had there.

Automa

This is where the actual parsing happens. Automa is insanely optimized, but there are still losses in a few sections:

One problem is that the original author of Automa have been unreachable for 6 months or so.

CodecZlib

In my experience, CodecZlib tends to be a bit slower than other gzip implementations. Not sure why, though, since it basically just calls a C function directly. Haven't dug too deeply into that. It must be either a problem with the compilation of the C code, or with basic inefficiencies of Julia's IO.

Basic IO functions

I think quite a bit of time is wasted in the buffering library underneath both Automa and CodecZlib, namely TranscodingStreams.jl. This is a little harder to analyse, but in practical usage, it appears to add a layer of indirection which is not 100% optimized.

Futhermore, it seems that a few Base Julia IO functions are not really 100% optimized either.

The losses here are on the margins - but hey, 10 places with 5% loss makes a library 50% slower.

Base Julia

Finally, Base Julia not 100% optimized. Again, it's always small things where one could reasonably argue that this doesn't actually matter. Here's a few examples:

Conclusion

You can optimize your code to make it faster. And SIMD capabilities could be added to Automa. This cuts the performance difference down to ~30% to Rust.

As usual, performance dies by a thousand cuts. When TranscodingStreams.jl was implemented, I bet no-one thought people would notice, or care about a 5% performance loss. The Julia core devs certainly thought using Int's for array lengths is fine enough - after all, the branch is perfectly predicted, so it's a small cost, right? Views need to be generic, so they should probably have stride information and such, but now it's suddenly harder to just have views that are lightweight and can ommit boundschecks. And explicit SIMD - people probably think it's not THAT important to get right. The autovectorizer is good enough in most cases, right?

In contrast, if you take the view that creating a single unnecessary branch or instruction in a low-level function is unacceptable, this attitude compounds across the entire stack, through all the packages. And then, you're no longer looking at a few percent here and there, but differences in tens of percent, or even nearing 100%.

jakobnissen commented 3 years ago

Oh, and to answer your question: No, I don't think it makes sense to create a bare-bone API. There are no fundamental design mistakes in FASTX or BioSequences that makes it inefficient. It's downstream packages, and tiny inefficiencies scattered across the entire ecosystem. We should be able to make the existing packages fast, it's no use creating new ones.

jonathanBieler commented 3 years ago

Thanks for the very nice answer.

When calling FASTQ.quality, you allocate a new array for every record. This too, is strictly speaking not necessary.

As far as I can tell FASTQ.quality is the only way the API provides to access qualities, no ? So you have to do all these allocations even though you don't really need them. To me it seems you should be able to access the data directly, maybe with some iterators, e.g. FASTQ.raw_quality(record) = (record.data[i]-0x21 for i in record.quality) (this gives a decent speed-up on my computer).

That said my example is a bit artificial (although some QC stats are pretty close to this), not sure it would be super useful in practice. Also I've been using this package and others on rather large dataset and they do the job just fine, but sometimes it seems like tools written in C++ are faster.

Last question, I was thinking maybe doing the same thing for BAM files, do you know if such a comparison as been done already ?

jakobnissen commented 3 years ago

That's a good point, we could implement quality like that. One issue would be that at the next iteration, the underlying data of this iterator would be overwritten without any warning. Rust don't have that issue, obviously. So it might need to be an opt-in function, like quality_iter. We could have something similar for the sequence. That's a good idea.

I do think these small benchmarks, like the one you've posted here are quite useful. They help keeping development focused on performance. Without it, we could convince ourselves that parsing FASTQ at 250 MB/s was excellent performance. Right now, I don't think FASTX and its underlying functionality is too far off being the fastest accurate parser around. Maybe we are 50% slower. These 50% still matters and should be focused on.

Speed is not the most important aspect of BioJulia, of course. Its main purpose is to be a swiss army knife that you can always reach to. So flexibility, modularity and integration with other packages is more important. But we do still need to be so fast that people won't re-implement BioJulia functionality in order to make it faster.

I've done comparisons with BAM myself, but I've not seen other people do them. XAM.jl is in a sad state of affairs. It's not spec-compliant and has a number of sub-optimal design decisions. You'll see I have quite a few issues and PR on the repo. But it turned out to be a little hard to fix because once I started fixing some issues, I ended up rewriting large parts of the package, and then I confused myself and became unfocused. I'm still unsure whether it's best to re-write it all from scratch, or to make a long series of breaking PRs and a new release.

One issue I did work on was the performance of XAM.jl's underlying BGZF de/compression library, so I created LibDeflate.jl and CodecBGZF.jl. But XAM.jl haven't yet migrated to CodecBGZF.jl. That should improve its performance significantly.

jakobnissen commented 3 years ago

Closing as resolved. You're welcome to comment to re-open if you disagree