Rethinking about the parsing interface

bicycle1885 commented 9 years ago

Thanks to the great effort of @dcjones, we're getting a powerful framework to parse various data formats at the speed of light. But I think that the interface of the parser can be more flexible and idiomatic in the context of Julia. The current interface is the read method only, and its behavior is significantly different from the methods defined in the standard library. In the Base module, read(io, type) always returns an object of type but read(io, FASTA) returns an iterator of biological sequences; this is counterintuitive to me.

I propose the following functions as public APIs on the top of the parser framework:

open(filename, format): open filename and return a stream indexed by format
read(stream, type): read a type object from stream
read!(stream, object): read a record into the preallocated object from stream
each(stream, type): return an iterator of type objects from stream
close(stream): close stream.

open, read, read! and close behave like methods in the standard library, respectively. each is something like Base.eachline. A stream object is indexed by a file format: defined as Stream{F<:FileFromat}, and Stream{FASTA} is, for example, a stream type of FASTA format..

The rationale behind this idea is that file format and data type are not completely dependent. For example, we can read DNA sequences from several formats (including FASTA, FASTQ, SAM, etc.) and a SAM file can generate several kinds of data (including DNA sequence, alignment, quality of sequence, etc.).

The next code may illustrate my idea clearly:

s = open("hg38.fa", FASTA)
while !eof(s)
    seq = read(s, DNASequence)
    ...
end
close(s)

or

open("hg38.fa", FASTA) do s
    for seq in each(s, DNASequence)
        ...
    end
end

If there is numerous number of records in a file, you can save memory allocation using read! as follows:

s = open("sample1234.bam", BAM)
seq = DNASequence()
while !eof(s)
    read!(s, seq)
    ...
end
close(s)

The multiple dispatching in Julia would enable this interface very easily; a method read(::Stream{Format}, ::Type{Record}) will be defined only if a Record object can be extracted from a Format stream.

I'd like to know your opinions. Do you think this is reasonable?

TransGirlCodes commented 9 years ago

This sounds quite reasonable to me. An API consistent with the rest of the language is only sensible, and one of our guidelines is good Julian style. I'm not actually overly familiar with Julia's Streams, indeed most work done in Julia currently works with whole chunks in memory, and AFAIK all input and output operations use an asynchronous 'innards', but are effectually synchronous in use. The third thing I have been thinking about contributing to BioJulia (after Phylo and what I have planned for Align) is Node-like streams and an event system analogous to Node's. How easy is it for us to just generate any stream we like using Julia's current API, I see from the code clips above they are defined by some FileFormat object?

bicycle1885 commented 9 years ago

Julia has several data types behave like a stream. When you call the open function defined in the Base module, you will get an IOStream object, which is a wrapper of the ios_t struct. The IOStream type seems to be synchronous by nature. Julia has other stream-like type called File defined in the FS module (https://github.com/JuliaLang/julia/blob/master/base/fs.jl), which is a wrapper of libuv and hence asynchronous. I don't know why Julia has both of these similar types. A stream type indexed by file format, I meant, is a thin wrapper of one of these stream type or a wrapper of a state type defined in the Ragel module.

dcjones commented 9 years ago

Thanks for the comments @bicycle1885. I'm generally on board with this idea, but there are a few details to work out.

While you can read a DNASequence from a many formats, each has a different set of a metadata (FASTA with name and description, FASTQ with name and quality scores, SAM with alignment info and basically anything with tags). So in some sense each file format has a natural type, and reading anything else involves some conversion. I think your read(stream, type) syntax is convenient, but we should also have just read(stream) that returns on object of the default type.

We could implement read(stream, type) as just read(stream, type) = convert(type, read(stream)), but I wonder if there's a more clever way to do this. If you are are reading SAM and only interested in the alignments (which is often the case), in theory you could make the parsing faster by not saving or validating sequences or quality scores.

I very much agree with the need for a read! function to do in-place non-allocating reading. With something like FASTQ or SAM/BAM allocation and GC is going to be bottleneck (I think that's already the case with FASTQ), so to be competitive with C we need to provide that low-level escape hatch.

The tricky part of this is that Julia strings and Bio.jl sequences are both immutable by convention. So we can overwrite the underlying arrays and avoid allocation, but it could lead to some very surprising results. I guess that's fine, as long we have the appropriate warnings and documentation. I'm tempted to call these functions unsafe_read!, just so people really know what they're getting into.

bicycle1885 commented 9 years ago

I agree. Natural return type for each data format makes sense.

Defining as read(stream, type) = convert(type, read(stream)) would be the cheapest way to implement it. I think it is sufficient for the first round, but since the convert function has semantic of "lossless conversion", I believe we should select more appropriate name. Composability of Ragel machines may make it possible to generate specialized parsers for each data type.

bicycle1885 commented 9 years ago

The exclamation mark of read! itself suggests destructive nature of the operation. So, the unsafe_ prefix looks a duplication to me.

dcjones commented 9 years ago

The exclamation mark of read! itself suggests destructive nature of the operation. So, the unsafe_ prefix looks a duplication to me.

True, but there are different levels of destructiveness. If I have a type like this

type Record
    name::ASCIIString
end

and I want to write a destructive function that shortens the name by one character, I could do this:

function shorten!(record::Record)
    record.name = record.name[1:end-1]
end

but even more efficiently, I could do this:

function shorten!(record::Record)
    resize!(record.name.data, length(record.name) - 1)
end

They are both modifying record in place, but the second is doing something very unusual by modifying the "immutable" string. That can cause serious problems if you do something like:

name = "bar"
a = Record(name)
b = Record(name)
shorten!(a)

println(b.name) # prints "ba"

unsafe_ maybe isn't quite the right prefix, because that's usually used to indicate that no bounds checking is performed. Mutating immutable strings won't segfault, but it can lead to serious bugs for anyone who doesn't completely understand what it's doing. I'm not sure how best to guard against that besides giving it a scary name and maybe not exporting it.

TransGirlCodes commented 9 years ago

I would say ! generally suggest modification of the original arguments or side effects. Take two functions from Phylo, set parent and set children for nodes (and delete parent and children for nodes) of a phylogeny. They are prefixed with unsafe (and not exported) because they really really really should not be used - they are internal to build or break one way connections between nodes, full connections between nodes are two way. So they have ! and unsafe_ as they modify the nodes, and are unsafe - if you used them you could seriously screw up the structure of the phylogeny and thus all computations on that tree later.

bicycle1885 commented 9 years ago

I understand your point. I think unsafe_ functions should not be exported from the module, so the read! method would have to be restricted to mutable sequences like #37.

By the way, I've found a package which already implements a formatted stream type: https://github.com/JuliaIO/FileIO.jl. Though this package loads a whole file in one shot, this is what I think of. If this works well without sacrificing flexibility and performance, there is no reason not to use it.

dcjones commented 9 years ago

Mutable sequences partially solves this, but a lot of the data (ids, descriptions, etc) is still regular strings. So we either have to reallocate them, unsafely modify them in place, or invent our own string-like types with different semantics.

I don't think the last option would be crazy. It could just be like this,

type Description
    data::Vector{UInt8}
end

Then define a convert function for strings that insulates the string from mutation by copying the data.

convert(::Type{String}, description::Description) = UTF8String(copy(description.data))

https://github.com/JuliaIO/FileIO.jl

We should consider hooking into FileIO, it does look pretty similar to what we want. Some changes might be required. Most notably, we need to be able to incrementally read and write data, which seems at odds with save and load API. I'll investigate this option, though.

TransGirlCodes commented 9 years ago

Do Bioinformatics formats have magic bytes? I don't think many evo and phylo related file formats do.

dcjones commented 9 years ago

BAM is the only one I know of that does. There are some HDF5 based formats, that are at least identifiable as HDF5 from the magic bytes.

breichholf commented 9 years ago

While there are some other Affymetrix specific binary file types, such as CDF I expect they're in the scope of BioJulia any time soon. Either way, Stephen Turner has a nice overview of bio-filetypes.

More on affymetrix file formats can be found on their developer homepage.

TransGirlCodes commented 9 years ago

I reckon magic-bytes would have really benefitted phylo formats... especially ones like Nexus where a program might have it's own blocks of data like Mr. Bayes.

bicycle1885 commented 9 years ago

So we either have to reallocate them, unsafely modify them in place, or invent our own string-like types with different semantics.

+1 for introducing mutable string-like types. I thinks immutability of string types is not an indispensable property.

In order to incrementally parse records, we may need to define a new type indexed by a format type. This is because there is no room to inject parsing state into the Stream type of FileIO.jl package:

immutable Stream{F<:DataFormat,IOtype<:IO} <: Formatted{F}
    io::IOtype
    filename::Nullable{UTF8String}
end

https://github.com/JuliaIO/FileIO.jl/blob/cfacc9918f07fb2f373b04ae4180f2828d08d963/src/query.jl#L170-L173

So, slightly changing the Stream type, the following ParserIO type would be more appropriate:

abstract AbstractParser

immutable ParserIO{F<:DataFormat,P<:AbstractParser}
    parser::P
    filename::Nullable{UTF8String}
end

function Base.read(p::ParserIO{format"FASTA"}, ::Type{DNASequence})
    ...
end

# and other methods

type FASTAParser <: AbstractParser
    state::Ragel.state
    ...
end

# and other parsers

Note that the ParserIO does not a subtype of IO. That means there is no responsibility to implement the read methods that read byte by byte from a stream.

dcjones commented 9 years ago

I'm having some success with this. I have have a in-place, non-allocating fastq parser working using mutable sequences, and a simple mutable string type. It can parse 1million 100nt reads (about 250MB) in 1.5 seconds (for comparison wc -l is 0.26 seconds), which is good considering we're encoding sequences, decoding quality scores, and validating everything.

One complication we need to figure out is what to do about Nullable. I was using it in the BED parser for optional fields, but we need everything to be mutable. I guess we could define our own MutableNullable type, but that's a little unfortunate. If anyone has a bright idea about how to handle typed, optional fields, let me know.

TransGirlCodes commented 9 years ago

I'm not sure I understand, Nullable variables are mutable aren't they? I would have thought then even if they were the field of an immutable type, the field binding is immutable, but the Nullable variable itself could be mutated?

With Phylo nodes, I couldn't easily predict the different fields and annotations a user might apply to a node. So I decided that rather than store such fields in the node variable, it could be done the other way, have the additional information point to a node on the tree. For such purposes a Dict{Node, T} was fine.

dcjones commented 8 years ago

I think this issue is mostly addressed now. Working with FileIO is still something we should consider, but not urgent. Are there other design points we need to think about, or can this be closed?

bicycle1885 commented 8 years ago

The current interface is satisfactory (thank you @dcjones !). If we need more functionality we can add them one by one in the future. So let's close this one.

BioJulia / Bio.jl

Rethinking about the parsing interface #79