BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.
https://biojulia.dev
MIT License
61 stars 20 forks source link

Unexpected behaviour when reading FASTA files with read!(reader, record) #13

Closed harryscholes closed 4 years ago

harryscholes commented 5 years ago

Setup

using FASTX

write("seqs.fa",
"""
>a b
VFQTFCRRTCCFARYDRFIR
>c d
SCHKIKTFTYPTMKRFQGEH
""")

Expected Behavior

Using the read!(reader, record) way of reading FASTA files (https://biojulia.net/FASTX.jl/stable/manual/fasta/)

function readfasta(fasta)
    records = FASTA.Record[]
    open(FASTA.Reader, fasta) do reader
        record = FASTA.Record()
        while !eof(reader)
            read!(reader, record)
            push!(records, record)
        end
    end
    return records
end

I would expect:

julia> readfasta("seqs.fa")
2-element Array{FASTX.FASTA.Record,1}:
 FASTX.FASTA.Record:
   identifier: a
  description: b
     sequence: VFQTFCRRTCCFARYDRFIR
 FASTX.FASTA.Record:
   identifier: c
  description: d
     sequence: SCHKIKTFTYPTMKRFQGEH

Current Behavior

However, all entries in the resulting array are for the final record in the file.

julia> readfasta("seqs.fa")
2-element Array{FASTX.FASTA.Record,1}:
 FASTX.FASTA.Record:
   identifier: c
  description: d
     sequence: SCHKIKTFTYPTMKRFQGEH
 FASTX.FASTA.Record:
   identifier: c
  description: d
     sequence: SCHKIKTFTYPTMKRFQGEH

Possible Solution / Implementation

I gather that this might be the 'correct' behaviour, but it is a massive gotcha. One way I've found to get this to work is to copy the record within the loop:

function readfasta(fasta)
    records = FASTA.Record[]
    open(FASTA.Reader, fasta) do reader
        record = FASTA.Record()
        while !eof(reader)
            read!(reader, record)
            push!(records, copy(record))
        end
    end
    return records
end

If this is the correct behaviour, maybe we could add a note in the docs to show how the overwriting can be avoided.

NB this problem is not encountered if you 'do work' with the record, then push it to an array, e.g.:

function readfasta(fasta)
    records = FASTA.Record[]
    open(FASTA.Reader, fasta) do reader
        record = FASTA.Record()
        while !eof(reader)
            read!(reader, record)
            record_no_desc = FASTA.Record(FASTA.identifier(record), FASTA.sequence(record))
            push!(records, record_no_desc)
        end
    end
    return records
end

Context

Reading through very large FASTA files and selecting records that meet some condition e.g. the identifier is in some set of ids that I want to keep e.g.:

function filterfasta(fasta, ids)
    records = FASTA.Record[]
    open(FASTA.Reader, fasta) do reader
        record = FASTA.Record()
        while !eof(reader)
            read!(reader, record)
            if FASTA.identifier(record) in ids
                push!(records, copy(record))
            end
        end
    end
    return records
end

Your Environment

(FASTARecordReaderBug.jl) pkg> st
    Status `~/Dropbox/Projects/FASTARecordReaderBug.jl/Project.toml`
  [c2308a5c] FASTX v1.1.0 #fd09319 (https://github.com/BioJulia/FASTX.jl.git)

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = atom  -a
  JULIA_NUM_THREADS = 4
TransGirlCodes commented 4 years ago

This is entirely intended behaviour. The manpage here: https://biojulia.net/FASTX.jl/stable/manual/fasta/ shows users how to iterate through the records in a file regularly, and then says you can also use read! to overwrite a single record - point being to reduce allocations. But maybe we should elaborate a bit and show what can happen if you're not being too careful..

TransGirlCodes commented 4 years ago

Filtering a file is also something FASTX should provide out of the box. I'll make sure it's in the next release.

harryscholes commented 4 years ago

Yep, I think maybe just a note or an example of a gocha and how to avoid it could be added to the docs.

jakobnissen commented 4 years ago

I can give this a shot. I'm thinking a good approach would be to create a FastaIterator object that wraps a Reader, and closes the stream when it's done. Then the filtering could be achieved with simply:

filtered = open(path) do file
    [rec for rec in FastaIterator(file) if my_predicate(rec)]
end

I think this is more flexible, since we can rely on all the filtering etc. of Base. The FastaIterator could optionally operate in-place on a single sequence.

CiaranOMara commented 4 years ago

I think the iterate method supplied by BioGenerics.jl, used here, performs in-place iteration. What about coupling iterate with julia's Iterator.filter?

filered = open(FASTA.Reader, filepath) do reader
    records = Vector{FASTA.Record}()
    for rec in Iterators.filter(my_predicate, reader))
        push!(records, rec) # Note: BioGenerics.jl's iterate returns a copy.
        # do stuff ...
    end
    return records
end
filtered = open((reader)->collect(Iterators.filter(my_predicate, reader)), FASTA.Reader, filepath)

Both of these open close the stream.

jakobnissen commented 4 years ago

Yes, this is a better approach than to create a new iterator object. Better to just rely on the already-existing functionality