Closed harryscholes closed 4 years ago
This is entirely intended behaviour. The manpage here: https://biojulia.net/FASTX.jl/stable/manual/fasta/ shows users how to iterate through the records in a file regularly, and then says you can also use read!
to overwrite a single record - point being to reduce allocations. But maybe we should elaborate a bit and show what can happen if you're not being too careful..
Filtering a file is also something FASTX should provide out of the box. I'll make sure it's in the next release.
Yep, I think maybe just a note or an example of a gocha and how to avoid it could be added to the docs.
I can give this a shot. I'm thinking a good approach would be to create a FastaIterator
object that wraps a Reader
, and closes the stream when it's done. Then the filtering could be achieved with simply:
filtered = open(path) do file
[rec for rec in FastaIterator(file) if my_predicate(rec)]
end
I think this is more flexible, since we can rely on all the filtering etc. of Base
. The FastaIterator
could optionally operate in-place on a single sequence.
I think the iterate
method supplied by BioGenerics.jl, used here, performs in-place iteration. What about coupling iterate
with julia's Iterator.filter?
filered = open(FASTA.Reader, filepath) do reader
records = Vector{FASTA.Record}()
for rec in Iterators.filter(my_predicate, reader))
push!(records, rec) # Note: BioGenerics.jl's iterate returns a copy.
# do stuff ...
end
return records
end
filtered = open((reader)->collect(Iterators.filter(my_predicate, reader)), FASTA.Reader, filepath)
Both of these open
close the stream.
Yes, this is a better approach than to create a new iterator object. Better to just rely on the already-existing functionality
Setup
Expected Behavior
Using the
read!(reader, record)
way of reading FASTA files (https://biojulia.net/FASTX.jl/stable/manual/fasta/)I would expect:
Current Behavior
However, all entries in the resulting array are for the final record in the file.
Possible Solution / Implementation
I gather that this might be the 'correct' behaviour, but it is a massive gotcha. One way I've found to get this to work is to copy the
record
within the loop:If this is the correct behaviour, maybe we could add a note in the docs to show how the overwriting can be avoided.
NB this problem is not encountered if you 'do work' with the record, then push it to an array, e.g.:
Context
Reading through very large FASTA files and selecting records that meet some condition e.g. the identifier is in some set of ids that I want to keep e.g.:
Your Environment