BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.
https://biojulia.dev
MIT License
61 stars 20 forks source link

Non-indexed FASTA: access sequence by genomic region #110

Closed apraga closed 1 year ago

apraga commented 1 year ago

Hi,

Indexed FASTA have extract to quickly get a sequence. I could not find a way to do the same without an index. Could such a method be added ?

Expected Behavior

For example, with sequence("chr1", 10, 1000, reader)

Possible Solution / Implementation

My attempt was

function sequence(chrom, x, y, reader)
    r= first(reader)
    seq = nothing
    while iterate(reader) != nothing
        if identifier(r) == chrom
            seq = FASTX.sequence(r)[x:y]
            break
        end
    end
    return seq
end

Your Environment

Thanks,

jakobnissen commented 1 year ago

Dear @apraga

Thanks for making the issue and showing interest in FASTX. I don't think this is a good idea for a few reasons:

  1. It's easy to implement yourself:
    julia> sequence(first(i for i in reader if identifier(i) == "S6C294"))[100:200]
    "CGACTTCGGGCCGATGTTGTTTCAATGCCGTCTGAATCGCTTCCATCCGGATTTCGGCAAGCAGGTTCACGCCTTCTGTGGGCAGTTCCAAACGTTGCGCG"
  2. The code above generalizes better - you can pick the first sequence that fulfills any arbitrary function, not just from its identifier.
  3. In the above code, it's clear that the operation is O(N), whereas extract is O(1), and it's also clear it's not random access which extract is.