BioJulia / BED.jl

MIT License
6 stars 5 forks source link

Reading whole file or FileIO integration ? #17

Open jonathanBieler opened 1 year ago

jonathanBieler commented 1 year ago

In 99% of my use cases l just want to read the whole bed file and get a vector of records. Doing so requires quite a bit of boilerplate :

# Import the BED module.
using BED

# Open a BED file.
reader = open(BED.Reader, "data.bed")

# Iterate over records.
for record in reader
    # Do something on record (see Accessors section).
    chrom = BED.chrom(record)
    # ...
end

# Finally, close the reader.
close(reader)

Boilerplate that every user will have to write (possibly several times). In comparison in Python you can do pr.read_bed(path). This seems like an important usability issue.

The solution would either to add a internal BED.load("file.bed") or to integrate FileO interface. I don't have a strong preference but l would also do the same for other "small" (that typically fit in memory) file format like VCF so it would be better to be consistent about it. To note FileIO also has a streaming interface for large files, so it could also be used for bams and fastqs.

kescobo commented 1 year ago

There was a lot of discussion of a similar nature over at FASTX.jl (see eg https://github.com/BioJulia/FASTX.jl/issues/76), and I think @jakobnissen has started putting in some work on that in BioGenerics.jl.

In short, you are completely correct :wink:

CiaranOMara commented 1 year ago

I'm for FileIO integration, but think it should be done in a new BEDFiles.jl package.

As a result of @jakobnissen's work, it's possible to load all records with the following.

records = open(collect, BED.Reader, "data.bed")

This approach also closes the reader.

And for completeness, below is a longhand variant using the do syntax.

records = open(BED.Reader, "data.bed") do reader
    return collect(reader)
end