BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.
https://biojulia.dev
MIT License
61 stars 20 forks source link

Consider despecializing AbstractFormattedIOs #91

Open jakobnissen opened 1 year ago

jakobnissen commented 1 year ago

With Julia 1.9, package image caching significantly reduces latency, as entire function can now be completely cached. FASTX's readers and writers are parameterized by the underlying IO type. This means different underlying IO types causes the entire Automa-generated code to be recompiled, needlessly. This takes about half a second.

We might consider somehow despecializing the readers and writers (AbstractFormattedIOs, AFIOs) on their underlying IO. This will cause a dynamic dispatch whenever the AFIOs run out of buffer and need to query their underlying IOs, but these operations are already slow, so I suspect impact would be minimal (earlier tests of mine showed insignificant slowdown when removing the type parameter of FASTAReader completely). This will make the code type unstable, but allow precompilation.

CiaranOMara commented 1 year ago

I think this is related. I was doing some work on XAM last year following @jonathanBieler PR to improve index handling, in that work I too found myself questioning the Reader parametrisation.

In the case of XAM.BAM, the Reader needs to know that it has a BGZFStream, but at the Reader level, having an awareness or parametrisation of the underlying stream is not useful when writing methods. And I thought, as I think you are saying here, in part, is that BGZFStream should specialise based on the underlying IO that it is aware of, not the Reader.

Similarly, for the XAM.SAM.Reader, following your suggestion here, I suppose it only needs to know it has an IO stream, and there is no need to parametrise the Reader on the IO stream type.

As an aside, I think the parametrisation that is useful at the Reader level is the type of indexer (BAI, FAI, .etc...) used.

CiaranOMara commented 1 year ago

@jakobnissen, could you give us/lay out a few quick structs to show/confirm the implementation you have in mind?

jakobnissen commented 1 year ago

I'll need to look through the BioJulia stack to figure out where this change is suitable. Maybe in Automa - maybe in BioGenerics - or maybe in the individual parser packages, like FASTX. When I get some free time I'll look at it.