Closed bicycle1885 closed 8 years ago
This sounds quite reasonable to me. An API consistent with the rest of the language is only sensible, and one of our guidelines is good Julian style. I'm not actually overly familiar with Julia's Streams, indeed most work done in Julia currently works with whole chunks in memory, and AFAIK all input and output operations use an asynchronous 'innards', but are effectually synchronous in use. The third thing I have been thinking about contributing to BioJulia (after Phylo and what I have planned for Align) is Node-like streams and an event system analogous to Node's. How easy is it for us to just generate any stream we like using Julia's current API, I see from the code clips above they are defined by some FileFormat object?
Julia has several data types behave like a stream.
When you call the open
function defined in the Base
module, you will get an IOStream
object, which is a wrapper of the ios_t
struct. The IOStream
type seems to be synchronous by nature. Julia has other stream-like type called File
defined in the FS
module (https://github.com/JuliaLang/julia/blob/master/base/fs.jl), which is a wrapper of libuv and hence asynchronous. I don't know why Julia has both of these similar types. A stream type indexed by file format, I meant, is a thin wrapper of one of these stream type or a wrapper of a state type defined in the Ragel module.
Thanks for the comments @bicycle1885. I'm generally on board with this idea, but there are a few details to work out.
While you can read a DNASequence
from a many formats, each has a different set of a metadata (FASTA with name and description, FASTQ with name and quality scores, SAM with alignment info and basically anything with tags). So in some sense each file format has a natural type, and reading anything else involves some conversion. I think your read(stream, type)
syntax is convenient, but we should also have just read(stream)
that returns on object of the default type.
We could implement read(stream, type)
as just read(stream, type) = convert(type, read(stream))
, but I wonder if there's a more clever way to do this. If you are are reading SAM and only interested in the alignments (which is often the case), in theory you could make the parsing faster by not saving or validating sequences or quality scores.
I very much agree with the need for a read!
function to do in-place non-allocating reading. With something like FASTQ or SAM/BAM allocation and GC is going to be bottleneck (I think that's already the case with FASTQ), so to be competitive with C we need to provide that low-level escape hatch.
The tricky part of this is that Julia strings and Bio.jl sequences are both immutable by convention. So we can overwrite the underlying arrays and avoid allocation, but it could lead to some very surprising results. I guess that's fine, as long we have the appropriate warnings and documentation. I'm tempted to call these functions unsafe_read!
, just so people really know what they're getting into.
I agree. Natural return type for each data format makes sense.
Defining as read(stream, type) = convert(type, read(stream))
would be the cheapest way to implement it. I think it is sufficient for the first round, but since the convert
function has semantic of "lossless conversion", I believe we should select more appropriate name.
Composability of Ragel machines may make it possible to generate specialized parsers for each data type.
The exclamation mark of read!
itself suggests destructive nature of the operation. So, the unsafe_
prefix looks a duplication to me.
The exclamation mark of read! itself suggests destructive nature of the operation. So, the unsafe_ prefix looks a duplication to me.
True, but there are different levels of destructiveness. If I have a type like this
type Record
name::ASCIIString
end
and I want to write a destructive function that shortens the name by one character, I could do this:
function shorten!(record::Record)
record.name = record.name[1:end-1]
end
but even more efficiently, I could do this:
function shorten!(record::Record)
resize!(record.name.data, length(record.name) - 1)
end
They are both modifying record
in place, but the second is doing something very unusual by modifying the "immutable" string. That can cause serious problems if you do something like:
name = "bar"
a = Record(name)
b = Record(name)
shorten!(a)
println(b.name) # prints "ba"
unsafe_
maybe isn't quite the right prefix, because that's usually used to indicate that no bounds checking is performed. Mutating immutable strings won't segfault, but it can lead to serious bugs for anyone who doesn't completely understand what it's doing. I'm not sure how best to guard against that besides giving it a scary name and maybe not exporting it.
I would say ! generally suggest modification of the original arguments or side effects. Take two functions from Phylo, set parent and set children for nodes (and delete parent and children for nodes) of a phylogeny. They are prefixed with unsafe (and not exported) because they really really really should not be used - they are internal to build or break one way connections between nodes, full connections between nodes are two way. So they have ! and unsafe_ as they modify the nodes, and are unsafe - if you used them you could seriously screw up the structure of the phylogeny and thus all computations on that tree later.
I understand your point.
I think unsafe_
functions should not be exported from the module, so the read!
method would have to be restricted to mutable sequences like #37.
By the way, I've found a package which already implements a formatted stream type: https://github.com/JuliaIO/FileIO.jl. Though this package loads a whole file in one shot, this is what I think of. If this works well without sacrificing flexibility and performance, there is no reason not to use it.
Mutable sequences partially solves this, but a lot of the data (ids, descriptions, etc) is still regular strings. So we either have to reallocate them, unsafely modify them in place, or invent our own string-like types with different semantics.
I don't think the last option would be crazy. It could just be like this,
type Description
data::Vector{UInt8}
end
Then define a convert function for strings that insulates the string from mutation by copying the data.
convert(::Type{String}, description::Description) = UTF8String(copy(description.data))
We should consider hooking into FileIO, it does look pretty similar to what we want. Some changes might be required. Most notably, we need to be able to incrementally read and write data, which seems at odds with save
and load
API. I'll investigate this option, though.
Do Bioinformatics formats have magic bytes? I don't think many evo and phylo related file formats do.
BAM is the only one I know of that does. There are some HDF5 based formats, that are at least identifiable as HDF5 from the magic bytes.
While there are some other Affymetrix specific binary file types, such as CDF I expect they're in the scope of BioJulia any time soon. Either way, Stephen Turner has a nice overview of bio-filetypes.
More on affymetrix file formats can be found on their developer homepage.
I reckon magic-bytes would have really benefitted phylo formats... especially ones like Nexus where a program might have it's own blocks of data like Mr. Bayes.
So we either have to reallocate them, unsafely modify them in place, or invent our own string-like types with different semantics.
+1 for introducing mutable string-like types. I thinks immutability of string types is not an indispensable property.
In order to incrementally parse records, we may need to define a new type indexed by a format type.
This is because there is no room to inject parsing state into the Stream
type of FileIO.jl package:
immutable Stream{F<:DataFormat,IOtype<:IO} <: Formatted{F}
io::IOtype
filename::Nullable{UTF8String}
end
So, slightly changing the Stream
type, the following ParserIO
type would be more appropriate:
abstract AbstractParser
immutable ParserIO{F<:DataFormat,P<:AbstractParser}
parser::P
filename::Nullable{UTF8String}
end
function Base.read(p::ParserIO{format"FASTA"}, ::Type{DNASequence})
...
end
# and other methods
type FASTAParser <: AbstractParser
state::Ragel.state
...
end
# and other parsers
Note that the ParserIO
does not a subtype of IO
. That means there is no responsibility to implement the read
methods that read byte by byte from a stream.
I'm having some success with this. I have have a in-place, non-allocating fastq parser working using mutable sequences, and a simple mutable string type. It can parse 1million 100nt reads (about 250MB) in 1.5 seconds (for comparison wc -l
is 0.26 seconds), which is good considering we're encoding sequences, decoding quality scores, and validating everything.
One complication we need to figure out is what to do about Nullable
. I was using it in the BED parser for optional fields, but we need everything to be mutable. I guess we could define our own MutableNullable
type, but that's a little unfortunate. If anyone has a bright idea about how to handle typed, optional fields, let me know.
I'm not sure I understand, Nullable variables are mutable aren't they? I would have thought then even if they were the field of an immutable type, the field binding is immutable, but the Nullable variable itself could be mutated?
With Phylo nodes, I couldn't easily predict the different fields and annotations a user might apply to a node. So I decided that rather than store such fields in the node variable, it could be done the other way, have the additional information point to a node on the tree. For such purposes a Dict{Node, T} was fine.
I think this issue is mostly addressed now. Working with FileIO is still something we should consider, but not urgent. Are there other design points we need to think about, or can this be closed?
The current interface is satisfactory (thank you @dcjones !). If we need more functionality we can add them one by one in the future. So let's close this one.
Thanks to the great effort of @dcjones, we're getting a powerful framework to parse various data formats at the speed of light. But I think that the interface of the parser can be more flexible and idiomatic in the context of Julia. The current interface is the
read
method only, and its behavior is significantly different from the methods defined in the standard library. In theBase
module,read(io, type)
always returns an object oftype
butread(io, FASTA)
returns an iterator of biological sequences; this is counterintuitive to me.I propose the following functions as public APIs on the top of the parser framework:
open(filename, format)
: openfilename
and return a stream indexed byformat
read(stream, type)
: read atype
object fromstream
read!(stream, object)
: read a record into the preallocatedobject
fromstream
each(stream, type)
: return an iterator oftype
objects fromstream
close(stream)
: close stream.open
,read
,read!
andclose
behave like methods in the standard library, respectively.each
is something likeBase.eachline
. A stream object is indexed by a file format: defined asStream{F<:FileFromat}
, andStream{FASTA}
is, for example, a stream type of FASTA format..The rationale behind this idea is that file format and data type are not completely dependent. For example, we can read DNA sequences from several formats (including FASTA, FASTQ, SAM, etc.) and a SAM file can generate several kinds of data (including DNA sequence, alignment, quality of sequence, etc.).
The next code may illustrate my idea clearly:
or
If there is numerous number of records in a file, you can save memory allocation using
read!
as follows:The multiple dispatching in Julia would enable this interface very easily; a method
read(::Stream{Format}, ::Type{Record})
will be defined only if aRecord
object can be extracted from aFormat
stream.I'd like to know your opinions. Do you think this is reasonable?